|
On several occasions I have noticed the jobs being failed due to file system error e.g. Stale File handle. The job thread dumps contain something like below- ERROR Client fs/client/fileclient/cc/client.cc:1515 Thread: 140316168423168 AllocateFid failed, File output.00242, error Stale File handle(116), primaryFid 2112.1398320.10590244 ERROR Client fs/client/fileclient/cc/writebuf.cc:229 Thread: 140316168423168 FlushWrite failed: File output.00242, error: Stale File handle(116), pfid 2112.1398320.10590244, off 2162688 6449.3374.201984 ERROR Client fs/client/fileclient/cc/client.cc:1515 Thread: 140317051700992 AllocateFid failed, File output.00242, error Stale File handle(116), primaryFid 2112.1398320.10590244or ERROR Client fs/client/fileclient/cc/client.cc:489 Thread: 140717038753536 Open failed for file /var/mapr/local/node/mapred/taskTracker/spill/, LookupFid error No such file or directory(2) What could be the reason for such failures and how to resolve them? |
|
This happens normally when a task fails to respond within the expected time limit, and is killed. The temporarily files for the task attempt are cleaned but the process might linger around for a few more seconds. During that time, if the process attempts to read any of the cleaned files, these errors show up in the logs. In short, these errors are not the symptom for the actual issue. Its like an after-effect. The task tracker fails due to its own set of reasons. The timestamps for these errors and the entries that precede these by a few seconds in the tasktracker logs can confirm this. Indeed that was the reason. I figured it out few days ago :-)
(17 Feb, 06:15)
Vinod Singh
|
|
/opt/mapr/bin/maprcli dump containerinfo -ids 2112 -json will give you the list of active fileserver nodes on the cluster where the file might be stored. Once you identify the nodes, have a look at the mfs logs on those nodes.That should provide more info about what the issue is. |