We are seeing lots of messages like this:

org.apache.hadoop.hdfs.StateChange - DIR* NameSystem.startFile: failed to create file

We were testing NN recovery from teh secondary NN and it seems like this takes forever. Even after recovery when it works, Hbase has problems like this.

What is happening?

asked 22 Jun '11, 09:06

FAQ's gravatar image

FAQ ♦
147363739
accept rate: 0%


What is happening is that the Hadoop name node was not designed with HA operation in mind. That means that recovery from the secondary name node takes a long time if you have many files and there are many corner conditions that can cause problems during this recovery process.

It is really hard to fix these issues. See the Avatar Node work at Facebook, for instance. It is really just much easier to use MapR in the first place. Then all of these recovery processes just go away because the cluster is designed from the ground up with high availability in mind.

link

answered 23 Jun '11, 15:56

Ted%20Dunning's gravatar image

Ted Dunning ♦♦
78116
accept rate: 94%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×76
×18
×14
×2

Asked: 22 Jun '11, 09:06

Seen: 994 times

Last updated: 23 Jun '11, 15:56

powered by OSQA