|
I have a 4-node MapR cluster, and each node has 12 mappers and 8 reducers. I'm trying to tune TeraSort on a MapR cluster and when I set the chunk size to 8 GB, the terasort job fails with the following error. I want to increase the block size such that the total number of map tasks for TeraSort is equal to the number of map slots in the cluster. I've set the chunk size of the directory to other values also (11 GB, 16 GB etc.) and I get the same error. The TeraGen job is configured to write 48 files. 12/02/28 16:32:25 INFO terasort.TeraSort: starting 12/02/28 16:32:25 INFO mapred.FileInputFormat: Total input paths to process : 32 java.lang.IllegalArgumentException: Offset 0 is outside of file (0..-1) at org.apache.hadoop.mapred.FileInputFormat.getBlockIndex(FileInputFormat.java:305) at org.apache.hadoop.mapred.FileInputFormat.getSplitHosts(FileInputFormat.java:461) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:266) at org.apache.hadoop.examples.terasort.TeraInputFormat.getSplits(TeraInputFormat.java:209) at org.apache.hadoop.examples.terasort.TeraInputFormat.writePartitionFile(TeraInputFormat.java:116) at org.apache.hadoop.examples.terasort.TeraSort.run(TeraSort.java:243) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.terasort.TeraSort.main(TeraSort.java:257) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:74) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:186) Thanks |
|
I tracked this issue down to the fact that the #records (100 Byte TeraSort records), chunksize, and the #map slots have to be calibrated carefully. I finally managed to get TeraSort to run when I had the following values for these parameters: num records : 10307921511 chunksize : 21474902016 num map slots (cluster wide) : 48 Total data sorted = 10307921511*100 Bytes ~= 960GB This is a pretty extreme chunk size. It is highly unlikely to yield any performance benefit. I know that Hadoop is sometimes used with crazy big block sizes, but that is usually to work around the existence of the namenode. MapR doesn't have a namenode so the need for the work-around goes away. All you should need is to have chunks large enough so that processing each one is masked by the cost of spawning a new mapper.
(29 Feb '12, 16:53)
TedDunning ♦♦
|