|
Hi, We're comparing an application currently running on Cloudera with MapR. Calls to the globStatus function on a FileSystem object are taking 5+ seconds to return in MapR while in Cloudera it is sub-second. Everything else seems to work okay. Example of the path being Globbed: "/offlined/5_10016/2012_04/3998447_41956861_*" The directory has 5300 files in it and the glob will match 6. Using 'hadoop fs -ls ...' it takes 6-7 seconds to get a listing both with the * and just of the directory. Any idea what is wrong? The main reason we are considering MapR is we hit NameNode limits on Cloudera with 11MM + files across 9000+ directories. Thanks, Chris |
|
@Chris, how does the following perform for you? % hadoop fs -ls /offlined/5_10016/2012_04 | grep /offlined/5_10016/2012_04/3998447_41956861_ I suspect it will be equally bad due to the sorting, but would be interested in hearing how it worked. Perhaps an unsorted version like % hadoop fs -lsN /offlined/5_10016/2012_04 | grep /offlined/5_10016/2012_04/3998447_41956861_ will run much faster. Is that something you could consider? First is as slow if not slower. Second doesn't run, 'N' isn't a valid option on the MapR version we are running (v. 1.2.3.12961.GA) I'm making a code change to see if using the NFS mount speeds up things
(27 Apr '12, 10:21)
ChrisCurtin
|
|
As a workaround if I use Java File operations to get directory listings against the NFS mount (using a Filter prefix to get some of the Glob operations) the performance is on par with the Cloudera solution. This is MapR specific, so I'd like to have this fixed via the globStatus() call, but I can continue my evaluation now. Thanks, Chris |
|
Chris, would like you to try a quick patch to see if it fixes the problem. I need to send you a new libMapRClient.so.1 (which resides in /opt/mapr/lib/), so what is your email address? I presume you are running 1.2.3, correct? |
Going directly to the /mapr mount point and using Linux 'ls' the performance is much better: 2s for the full directory, 0.16 s for the partial match (3998447_41956861_*).
We have not recompiled against any MapR specific libraries (nor do we include the Cloudera ones in the JAR). Do we need to?
Firstly, you shouldn't compile against any MapR-specific libraries.
Secondly, w.r.t. to the speed problem, MapR keeps the filenames on disk in a random manner, while CDH holds it in memory in a java map (ie, sorted). Thus the 10x performance difference on very large directories when sorting them alphabetically. We should try to fix it by sending over the filter "3998477_49156861_*" to the file-server and returning a much smaller set for the Hadoop shell to sort. Will open a bug for it.
Over NFS, Linux's "ls" will fetch the entire directory and sort it (MapR never sees the globbing). situation is not possible to fix.