I have been trying a few Pig scripts that have been working fine on the Cloudera distribution of Hadoop. I can create a new zebra store as follows:

STORE C INTO '/user/abc/c' USING org.apache.hadoop.zebra.pig.TableStorer('');

It gets created, however when I try to read it back using the following:

C = LOAD '/user/abc/c' USING org.apache.hadoop.zebra.pig.TableLoader('', 'sorted');

I get the following error.
Caused by: java.io.IOException: BasicTable.Reader constructor failed : Missing Meta File of maprfs://10.18.1.9:7222/user/abc/c/CG0/.meta at org.apache.hadoop.zebra.io.BasicTable$Reader.<init>(BasicTable.java:325) at org.apache.hadoop.zebra.io.BasicTable$Reader.<init>(BasicTable.java:289) at org.apache.hadoop.zebra.mapreduce.TableInputFormat.setSplitMode(TableInputFormat.java:466) at org.apache.hadoop.zebra.pig.TableLoader.setSortOrder(TableLoader.java:198) at org.apache.hadoop.zebra.pig.TableLoader.getSchema(TableLoader.java:418) at org.apache.pig.newplan.logical.relational.LOLoad.getSchemaFromMetaData(LOLoad.java:150) ... 20 more

It's interesting that the behavior of the same script ( running the same version of Pig/Zebra ) behaves differently.

I was curious if anyone else has run into any issues with Pig/Zebra on MapR.

Thanks

asked 19 Aug '11, 15:26

Andy%20Sautins's gravatar image

Andy Sautins
1333
accept rate: 0%


MapR's mapreduce is doing some optimization to avoid launching setup and cleanup tasks. The optimization is to read the conf and figure out which OutputCommitter to use. This behavior was consistent with the old 'mapred' apis.

In the new 'mapreduce' apis, that Pig uses, the OutputCommitter is determined dynamically from the OutputFormat, hence cleanup step was being skipped in MapR (since we are reading this from conf). As a result, the .meta file was not being created and the _temporary dir was still present in output directory.

We will have a fix for this in the next release. As a workaround for now, before launching pig shell,

set PIG_OPTS=-Dmapred.output.committer.class=org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter
or set this in your job configuration. This will invoke the correct cleanup steps and you will have a happy pig with a happy zebra.

link

answered 22 Aug '11, 13:15

richa's gravatar image

richa
22615
accept rate: 21%

edited 22 Aug '11, 13:33

TedDunning's gravatar image

TedDunning ♦♦
2.4k315

To reproduce the error you are seeing, can I get a small snippet or hint on the data you are using to figure out what issue you might be hitting?

Also, can you paste the entire directory structure under /user/abc/c that gets generated after the store command?

link

answered 19 Aug '11, 16:33

richa's gravatar image

richa
22615
accept rate: 21%

Sure,

So here's a small example that reproduces the problem. In my file in.txt contains 3 lines (1t1, 2t2, and 3t3).

REGISTER $lib/zebra-0.8.0-dev.jar

A = LOAD '$path/in.txt' USING PigStorage() as (c1:int, c2:int);

STORE A INTO '$path/out' USING org.apache.hadoop.zebra.pig.TableStorer('');

B = LOAD '$path/out' USING org.apache.hadoop.zebra.pig.TableLoader('');

DUMP B;

On a MapR cluster the dump fails and the store output $path/out looks as follows:

out

out/_temporary

out/_temporary/CG0

out/.btschema

out/CG0

out/CG0/part-0

out/CG0/.schema

On a Cloudera CDH3U1 cluster the dump succeeds and the store output is as follows. Note the addition of the .meta file. It's also interesting that the _temporary files are not cleaned up either on the MapR cluster. The only clue I could get is maybe BasicTableOutputFormat.close is somehow not being called (http://mail-archives.apache.org/mod_mbox/hadoop-pig-user/201007.mbox/%3C6DED586E81D4104CB7F9C870E174A09704749DDD@SNV-EXVS06.ds.corp.yahoo.com%3E)

out

out/CG0

out/CG0/.meta

out/CG0/part-0

out/CG0/.schema

out/.btschema

Note I am also running pig-0.9 and setting hadoop version with PIG_HADOOP_VERSION. It's quite possible that the problem I am seeing is with pig-0.9 and how it is picking up the hadoop jar.

link

answered 19 Aug '11, 21:10

Andy%20Sautins's gravatar image

Andy Sautins
1333
accept rate: 0%

I also suspect that it may have to do with how pig is picking up the hadoop configuration. That can be as much a source of trouble as getting the wrong jar.

(19 Aug '11, 22:41) TedDunning ♦♦

I don't think the problem is with pig picking up the wrong version of hadoop. I did some debugging and it looks like BasicTableOutputFormat.close is not being called, which finally creates the .meta file to indicate that the Column Group is not being written to anymore.

I am debugging further to nail the issue. Will keep you posted.

link

answered 20 Aug '11, 15:13

richa's gravatar image

richa
22615
accept rate: 21%

I set PIG_OPTS as you suggest ( export PIG_OPTS = -Dmapred.output.committer.class=org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter ) and the .meta file is created successfully.

Thanks for your help!

link

answered 22 Aug '11, 16:02

Andy%20Sautins's gravatar image

Andy Sautins
1333
accept rate: 0%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×25
×1

Asked: 19 Aug '11, 15:26

Seen: 912 times

Last updated: 22 Aug '11, 16:02

powered by OSQA