On several occasions I have noticed the jobs being failed due to file system error e.g. Stale File handle. The job thread dumps contain something like below-

ERROR Client fs/client/fileclient/cc/client.cc:1515 Thread: 140316168423168 AllocateFid failed, File output.00242, error Stale File handle(116), primaryFid 2112.1398320.10590244
ERROR Client fs/client/fileclient/cc/writebuf.cc:229 Thread: 140316168423168 FlushWrite failed: File output.00242, error: Stale File handle(116), pfid 2112.1398320.10590244, off 2162688 6449.3374.201984
ERROR Client fs/client/fileclient/cc/client.cc:1515 Thread: 140317051700992 AllocateFid failed, File output.00242, error Stale File handle(116), primaryFid 2112.1398320.10590244
or
ERROR Client fs/client/fileclient/cc/client.cc:489 Thread: 140717038753536 Open failed for file /var/mapr/local/node/mapred/taskTracker/spill/, LookupFid error No such file or directory(2)

What could be the reason for such failures and how to resolve them?

asked 06 Feb, 22:13

Vinod%20Singh's gravatar image

Vinod Singh
1124
accept rate: 0%


This happens normally when a task fails to respond within the expected time limit, and is killed. The temporarily files for the task attempt are cleaned but the process might linger around for a few more seconds. During that time, if the process attempts to read any of the cleaned files, these errors show up in the logs.

In short, these errors are not the symptom for the actual issue. Its like an after-effect. The task tracker fails due to its own set of reasons. The timestamps for these errors and the entries that precede these by a few seconds in the tasktracker logs can confirm this.

link

answered 17 Feb, 05:35

Nabeel's gravatar image

Nabeel ♦♦
367124
accept rate: 13%

Indeed that was the reason. I figured it out few days ago :-)

(17 Feb, 06:15) Vinod Singh

/opt/mapr/bin/maprcli dump containerinfo -ids 2112 -json

will give you the list of active fileserver nodes on the cluster where the file might be stored. Once you identify the nodes, have a look at the mfs logs on those nodes.That should provide more info about what the issue is.

link

answered 07 Feb, 01:38

Nabeel's gravatar image

Nabeel ♦♦
367124
accept rate: 13%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×38

Asked: 06 Feb, 22:13

Seen: 263 times

Last updated: 17 Feb, 06:15

powered by OSQA