I have a 4-node MapR cluster, and each node has 12 mappers and 8 reducers. I'm trying to tune TeraSort on a MapR cluster and when I set the chunk size to 8 GB, the terasort job fails with the following error.

I want to increase the block size such that the total number of map tasks for TeraSort is equal to the number of map slots in the cluster. I've set the chunk size of the directory to other values also (11 GB, 16 GB etc.) and I get the same error. The TeraGen job is configured to write 48 files.


12/02/28 16:32:25 INFO terasort.TeraSort: starting 12/02/28 16:32:25 INFO mapred.FileInputFormat: Total input paths to process : 32 java.lang.IllegalArgumentException: Offset 0 is outside of file (0..-1) at org.apache.hadoop.mapred.FileInputFormat.getBlockIndex(FileInputFormat.java:305) at org.apache.hadoop.mapred.FileInputFormat.getSplitHosts(FileInputFormat.java:461) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:266) at org.apache.hadoop.examples.terasort.TeraInputFormat.getSplits(TeraInputFormat.java:209) at org.apache.hadoop.examples.terasort.TeraInputFormat.writePartitionFile(TeraInputFormat.java:116) at org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:243) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:257) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

Thanks

asked 28 Feb '12, 17:06

Kshitij%20S's gravatar image

Kshitij S
41222
accept rate: 33%

edited 28 Feb '12, 17:22


Kshitij,

Could you please paste the output of

hadoop mfs -lsr "input directory for terasort"

link

answered 29 Feb '12, 15:18

amit's gravatar image

amit ♦
41415
accept rate: 56%

I tracked this issue down to the fact that the #records (100 Byte TeraSort records), chunksize, and the #map slots have to be calibrated carefully. I finally managed to get TeraSort to run when I had the following values for these parameters:

num records : 10307921511 chunksize : 21474902016 num map slots (cluster wide) : 48

Total data sorted = 10307921511*100 Bytes ~= 960GB

link

answered 29 Feb '12, 16:08

Kshitij%20S's gravatar image

Kshitij S
41222
accept rate: 33%

This is a pretty extreme chunk size. It is highly unlikely to yield any performance benefit.

I know that Hadoop is sometimes used with crazy big block sizes, but that is usually to work around the existence of the namenode. MapR doesn't have a namenode so the need for the work-around goes away. All you should need is to have chunks large enough so that processing each one is masked by the cost of spawning a new mapper.

(29 Feb '12, 16:53) TedDunning ♦♦
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×29
×4

Asked: 28 Feb '12, 17:06

Seen: 1,478 times

Last updated: 29 Feb '12, 16:53

powered by OSQA