|
I have been trying a few Pig scripts that have been working fine on the Cloudera distribution of Hadoop. I can create a new zebra store as follows: STORE C INTO '/user/abc/c' USING org.apache.hadoop.zebra.pig.TableStorer(''); It gets created, however when I try to read it back using the following: C = LOAD '/user/abc/c' USING org.apache.hadoop.zebra.pig.TableLoader('', 'sorted'); I get the following error. It's interesting that the behavior of the same script ( running the same version of Pig/Zebra ) behaves differently. I was curious if anyone else has run into any issues with Pig/Zebra on MapR. Thanks |
|
MapR's mapreduce is doing some optimization to avoid launching setup and cleanup tasks. The optimization is to read the conf and figure out which OutputCommitter to use. This behavior was consistent with the old 'mapred' apis. In the new 'mapreduce' apis, that Pig uses, the OutputCommitter is determined dynamically from the OutputFormat, hence cleanup step was being skipped in MapR (since we are reading this from conf). As a result, the .meta file was not being created and the _temporary dir was still present in output directory. We will have a fix for this in the next release. As a workaround for now, before launching pig shell, set PIG_OPTS=-Dmapred.output.committer.class=org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitteror set this in your job configuration. This will invoke the correct cleanup steps and you will have a happy pig with a happy zebra. |
|
To reproduce the error you are seeing, can I get a small snippet or hint on the data you are using to figure out what issue you might be hitting? Also, can you paste the entire directory structure under /user/abc/c that gets generated after the store command? |
|
Sure, So here's a small example that reproduces the problem. In my file in.txt contains 3 lines (1t1, 2t2, and 3t3).
On a MapR cluster the dump fails and the store output $path/out looks as follows:
On a Cloudera CDH3U1 cluster the dump succeeds and the store output is as follows. Note the addition of the .meta file. It's also interesting that the _temporary files are not cleaned up either on the MapR cluster. The only clue I could get is maybe BasicTableOutputFormat.close is somehow not being called (http://mail-archives.apache.org/mod_mbox/hadoop-pig-user/201007.mbox/%3C6DED586E81D4104CB7F9C870E174A09704749DDD@SNV-EXVS06.ds.corp.yahoo.com%3E)
Note I am also running pig-0.9 and setting hadoop version with PIG_HADOOP_VERSION. It's quite possible that the problem I am seeing is with pig-0.9 and how it is picking up the hadoop jar. I also suspect that it may have to do with how pig is picking up the hadoop configuration. That can be as much a source of trouble as getting the wrong jar.
(19 Aug '11, 22:41)
TedDunning ♦♦
|
|
I don't think the problem is with pig picking up the wrong version of hadoop. I did some debugging and it looks like BasicTableOutputFormat.close is not being called, which finally creates the .meta file to indicate that the Column Group is not being written to anymore. I am debugging further to nail the issue. Will keep you posted. |
|
I set PIG_OPTS as you suggest ( export PIG_OPTS = -Dmapred.output.committer.class=org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter ) and the .meta file is created successfully. Thanks for your help! |