1
1

Hi,

We're comparing an application currently running on Cloudera with MapR. Calls to the globStatus function on a FileSystem object are taking 5+ seconds to return in MapR while in Cloudera it is sub-second.

Everything else seems to work okay. Example of the path being Globbed:

"/offlined/5_10016/2012_04/3998447_41956861_*"

The directory has 5300 files in it and the glob will match 6.

Using 'hadoop fs -ls ...' it takes 6-7 seconds to get a listing both with the * and just of the directory.

Any idea what is wrong? The main reason we are considering MapR is we hit NameNode limits on Cloudera with 11MM + files across 9000+ directories.

Thanks,

Chris

asked 27 Apr '12, 06:03

ChrisCurtin's gravatar image

ChrisCurtin
17581219
accept rate: 22%

Going directly to the /mapr mount point and using Linux 'ls' the performance is much better: 2s for the full directory, 0.16 s for the partial match (3998447_41956861_*).

We have not recompiled against any MapR specific libraries (nor do we include the Cloudera ones in the JAR). Do we need to?

(27 Apr '12, 06:12) ChrisCurtin

Firstly, you shouldn't compile against any MapR-specific libraries.

Secondly, w.r.t. to the speed problem, MapR keeps the filenames on disk in a random manner, while CDH holds it in memory in a java map (ie, sorted). Thus the 10x performance difference on very large directories when sorting them alphabetically. We should try to fix it by sending over the filter "3998477_49156861_*" to the file-server and returning a much smaller set for the Hadoop shell to sort. Will open a bug for it.

Over NFS, Linux's "ls" will fetch the entire directory and sort it (MapR never sees the globbing). situation is not possible to fix.

(27 Apr '12, 07:13) MC Srivas ♦♦

@Chris, how does the following perform for you?

% hadoop fs -ls /offlined/5_10016/2012_04 | grep /offlined/5_10016/2012_04/3998447_41956861_

I suspect it will be equally bad due to the sorting, but would be interested in hearing how it worked.

Perhaps an unsorted version like

% hadoop fs -lsN /offlined/5_10016/2012_04 | grep /offlined/5_10016/2012_04/3998447_41956861_

will run much faster. Is that something you could consider?

link

answered 27 Apr '12, 09:11

MC%20Srivas's gravatar image

MC Srivas ♦♦
2.6k1517
accept rate: 35%

First is as slow if not slower. Second doesn't run, 'N' isn't a valid option on the MapR version we are running (v. 1.2.3.12961.GA)

I'm making a code change to see if using the NFS mount speeds up things

(27 Apr '12, 10:21) ChrisCurtin

As a workaround if I use Java File operations to get directory listings against the NFS mount (using a Filter prefix to get some of the Glob operations) the performance is on par with the Cloudera solution.

This is MapR specific, so I'd like to have this fixed via the globStatus() call, but I can continue my evaluation now.

Thanks,

Chris

link

answered 27 Apr '12, 10:56

ChrisCurtin's gravatar image

ChrisCurtin
17581219
accept rate: 22%

Chris, would like you to try a quick patch to see if it fixes the problem. I need to send you a new libMapRClient.so.1 (which resides in /opt/mapr/lib/), so what is your email address? I presume you are running 1.2.3, correct?

link

answered 28 Apr '12, 18:30

MC%20Srivas's gravatar image

MC Srivas ♦♦
2.6k1517
accept rate: 35%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×1

Asked: 27 Apr '12, 06:03

Seen: 546 times

Last updated: 28 Apr '12, 18:30

Related questions

powered by OSQA