|
We are seeing lots of messages like this: org.apache.hadoop.hdfs.StateChange - DIR* NameSystem.startFile: failed to create file We were testing NN recovery from teh secondary NN and it seems like this takes forever. Even after recovery when it works, Hbase has problems like this. What is happening? |
|
What is happening is that the Hadoop name node was not designed with HA operation in mind. That means that recovery from the secondary name node takes a long time if you have many files and there are many corner conditions that can cause problems during this recovery process. It is really hard to fix these issues. See the Avatar Node work at Facebook, for instance. It is really just much easier to use MapR in the first place. Then all of these recovery processes just go away because the cluster is designed from the ground up with high availability in mind. |