|
I have a single node, Dell R720XD host with 12 3TB SAS 7.2k disks off of a perc 710 w/ 1 GB write cache running 1.2.3.12961.GA under Ubuntu 12.04 beta 2 amd64 and the latest Sun JDK. All disks are configured as individual RAID0 volumes (13 total virtual disks including the RAID1 OS drives which are on 2 dedicated drives outside of the storage pool.) CPU is 2 six-core Xeon E5-2620. 16GB RAM. When using a non-mapr software RAID solution like mdraid 0 or zfs, I'm able to perform sustained sequential writes and reads to the combined volume at a rate of between 1.1GB/sec and 1.5GB/sec. I get similar performance when accessing said software RAID volume via NFS via localhost. When accessing the same volume via MapR NFS (configured to access the raw /dev/sd_ devices individually), performance drops to about 200MB/sec, and parallel access to the nfsserver process reduces performance linearally as processes increase, to the point where 10 concurrent writes of a 1 gig file cached entirely in RAM result in 20MB/sec of throughput. Disabling compression has no effect. I've got 2 10GbE adapters in this host (and the other hosts that I was hoping to put in an eventual M5 cluster), and was rather hoping to see performance closer to 600-800MB/sec for large sequential reads and writes via NFS. What can I do to help things along? Thanks in advance. |
|
I am unable to figure out from your description what your test is doing. (a) please describe your test, or better yet, post your test program in pastebin (place a link to it from here), and drop a description about how you are running the test (b) is this on one machine, with no network? if not, how many machines, what is the replication factor? (c) what is the switch between them ... have you verified that the switch works OK? if so, how did you do it? (d) Are the NICs bonded? If not, are you letting MapR figure it out? If the boxes have a 100 MB/s NIC for admin, how are you ensuring that the 100MB/s NIC isn't getting used by MapR? (see the env var MAPR_SUBNETS in the docs at http://mapr.com/doc) Fair enough :) a) Test is fairly simple: 1) Create a random datafile via scrub -S 5G - > /tmp/randfile 2) time cat /tmp/randfile > /dev/null twice, ensuring that the entire source file has entered local fs cache. 3) mount 127.0.0.1:/mapr /mapr 4) time cat /tmp/randfile > /mapr/my.cluster.com/volume/randfiledest, or use rsync alternatively b) Yes, single host. No network in between yet, just nfs via lo0 interface. c) n/a d) n/a, but the admin traffic will eventually travel on a 1GbE NIC via VLAN (as opposed to the dual 10GbE NICs which will be dedicated to mapr) Thanks again.
(27 Apr '12, 21:14)
peppert
|
|
In addition to Srivas comments, the use of a volume manager to carve 12 disks into 13 virtual disks may cause some confusion about write scheduling. A simpler arrangement where the majority of the virtual disks represent exactly a single physical drive could help a lot. Carving the RAID-1 OS partition out like that may require some slightly fancy footwork to avoid effectively losing a bit of space on the other drives. One suggestion is to turn the first two drives into four virtual drives, two for OS and two for storage. Then you can put the two larger partitions on these disks in a single disk group and arrange the remaining 11 drives into congenial groupings. If you are only using a relatively small boot space (100G or so), then you might be willing to let the 3% lossage go unnoticed and simply let the system configure your disk groups for you. The lossage occurs because block devices in a disk group should be the same size. I would also worry a bit if your logical volume manager is striping writes as seems to be implied by your independent transfer rate test. That could mean that each of your virtual drives is actually resident on all of your drives and the resulting write patterns are likely to be very confusing to the MapR software. The other (slight) oddity in your configuration is the fact that you have CPU-heavy nodes without much memory. The normal recommendation is 4-6GB per core. You have 12 cores on each node so 48-72GB would be normal. Without enough memory, you may wind up a bit starved in terms of map-reduce slots depending on how much memory you give each slot and how much overcommit you allow. One further test you can do with your system is to verify each virtual volume (in your configuration or the single disk per volume configuration) using dd. You should be able to transfer data at pretty much exactly the disk max I/O rate less whatever overhead your controller imposes. This should be 100-150MB/s. If you get more than that, then your controller is getting fancy under the covers in an undesirable way. If you get much less than that, then there is a separate problem. You should also be able to do dd's to each separate block device and get independent transfers at the same maximum rate with an aggregate at about the same level as you saw with your RAID in place. Sorry, Ted, I wasn't entirely clear. There are actually 14 physical disks in the machine. 2 300GB 10k RPM SAS drives are setup as a raid1 set and contain the OS. The other 12 disks are setup as individual RAID0 volumes, effectively making them JBOD and appear to Linux as individual /dev/sd devices. I fed the raw devices to disksetup -F. The test I had performed pre-mapr with > 1GB/sec performance was simply creating a software RAID0 volume of the individual drives. Both linux's native md and also zfs-linux (kernel) provided similar performance characteristics. Individual drive I/O is about 150MB/sec sustained. Aggregate performance is in excess of 1GB/sec. Both individual disk and combined performance were tested via dd if=/dev/zero and bonnie++ with 1MB blocksize and at least 32GB of actual data being moved.
(27 Apr '12, 21:23)
peppert
Thanks for the insight on memory sizing. One of the reasons we're starting with a single node in testing is to discover an appropriate amount of RAM to allocate and how to allocate it. Considering we're only utilizing the MFS/NFS portions of mapr, and using it solely as a distributed scalable filesystem, perhaps it makes sense to disable half of the cores in place or reallocate the default memory distribution toward our specific needs? I'm getting the impression that you guys feel it's odd to receive performance this slow, even on a single node, and I'm encouraged by that fact :)
(27 Apr '12, 21:23)
peppert
Yeah... the memory is fine for testing the I/O speed, just not so good for map-reduce testing. And yes, this performance is absurdly low. Your 1+GB/s numbers are much more in line with the norm here.
(27 Apr '12, 22:34)
TedDunning ♦♦
|
|
Lets do a couple of very simple tests first, before doing all kinds of striping via LVM. a) use 6 raw drives only (no LVM or anything in-between the disks and MapR). (probably might have to reformat the cluster if this is the only node ... make sure you completely clear out the Zookeeper data in /opt/mapr/zkdata when you do so) b) turn off MapR's compression, since you want to test random data. To turn it off via NFS, "vi /mapr/my.cluster.com/volume/.dfs_attributes" and set compression=false, and save the file (see http://mapr.com/doc/display/MapR/dfs-attributes+dir for more info). When randfiledest gets created, it will inherit the "compression=false" property from its parent dir. Run the test, and lets make sure you see about 300 MB/s consistently. We can work our way to higher numbers after that. http://pastebin.com/ZtLB9wu9 has everything I just did per your suggestion. I left out one fact, I set both replication and minimum replication to 1 on the /content volume. Still only seeing ~150MB/sec, roughly the performance of a single disk. slightly faster because of the FBWC. for example, a raw disk dd right afterwards: root@hdfs001:/mapr/my.cluster.com/content# dd if=/dev/zero of=/dev/sdi bs=1M count=5k 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB) copied, 32.3842 s, 166 MB/s And just to be clear, I can visually see that only one drive is being written to when I did the above dd. MapR should be able to write to the six disks as an aggregate faster than just a single disk, right?
(27 Apr '12, 22:17)
peppert
top right after the test: top - 22:18:34 up 30 min, 1 user, load average: 0.03, 0.16, 0.17 Tasks: 204 total, 1 running, 203 sleeping, 0 stopped, 0 zombie Cpu(s): 0.0%us, 0.0%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16386612k total, 12321144k used, 4065468k free, 10812k buffers Swap: 16729084k total, 3448k used, 16725636k free, 3135052k cached
(27 Apr '12, 22:18)
peppert
|
|
Thx. Can you write a second file with "oflag=sync" added to the dd command line? I think the NFS client might be choking. Also, I am surprised only 1 drive is being used ... the disksetup command should've created two raid-0 groups of 3 drives each, and alternate between each raid-0 group every 256M. So it should be using all 6, but only 3 at a time ... which is why I mentioned the 300MB/s number. Can you paste the output of "/opt/mapr/server/mrconfig sp info"? I think he meant that only one drive was being hit by the dd. This is a response to my comments earlier about possible LVM weirdness.
(27 Apr '12, 22:36)
TedDunning ♦♦
root@hdfs001:/mapr/my.cluster.com/content# dd if=/dev/zero of=/dev/sdi bs=1M count=5k oflag=sync 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB) copied, 31.9359 s, 168 MB/s Visually, the MapR FS behaves as you suggest - I can visually see 3 drives being written at a time, and sequencing through all 6 in the pool. The single drive I mentioned was for the single raw dd to /dev/sdi that I did after the NFS test. there was no info cmd, i think you want list? root@hdfs001:/mapr/my.cluster.com/content# /opt/mapr/server/mrconfig sp list ListSPs resp: status 0:2 No. of SPs (2), totalsize 16468961 MB, totalfree 16462899 MB SP 0: name SP1, Online, size 8232432 MB, free 8229401 MB, path /dev/sdb SP 1: name SP4, Online, size 8236528 MB, free 8233498 MB, path /dev/sde
(27 Apr '12, 22:38)
peppert
|
|
I meant using oflag=sync in this command that writes a file to MapR (that you showed on pastebin), and not on /dev/sdi
Can you paste /opt/mapr/logs/mfs.log into pastebin? I will take a look at whats happening. Its quite strange, you should see much better perf ... about 400 MB/s or so on those 3 drives. Please paste the entire mfs.log, as well as /opt/mapr/logs/disksetup.log into pastebin and give me a ptr. Thanks! Hey guys, sorry, fell asleep :) Probably a good thing since I missed the of= target in that dd. If you can email me a pubkey, I'm happy to let you on the box to romp around. root@hdfs001:/mapr/my.cluster.com/content# dd if=/dev/zero of=/mapr/my.cluster.com/content/zerofile bs=1M count=5k oflag=sync 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB) copied, 35.4793 s, 151 MB/s pastebin: http://pastebin.com/64vMxEdE
(28 Apr '12, 07:29)
peppert
|
|
THanks for pasting, but looks like the logs rolled over. What I wanted to see was how mfs was started. Show the output of "ps -ef | grep mfs", and of the first 500 lines of the original mfs.log. Another interesting test to try would be to write several streams to MapR simultaneously, in the same dir. Eg, for i in $(seq 0 10); do dd if=/dev/zero of=/mapr/my.cluster.com/volume/file.$i bs=1M count=5k oflag=sync & done You can look at the aggregate using "iostat" and let me know what you see. You should start seeing about 600MB/s or so, given that you have 6 drives. http://pastebin.com/GZA1HmLZ for yesterday's mfs log all 10 writes came out like: 5368709120 bytes (5.4 GB) copied, 232.635 s, 23.1 MB/s so, only 230MB/s aggregate. still well under the single copy performance of a direct hadoop fs put. also - iostat agreed: Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 18.36 217.96 11.18 1092 56 sdd 166.07 0.00 38146.11 0 191112 sdb 164.67 0.00 37775.65 0 189256 sdc 166.27 0.00 37425.95 0 187504 sde 179.24 0.00 41333.33 0 207080 sdf 178.84 0.00 41227.94 0 206552 sdg 177.84 0.00 41063.47 0 205728
(28 Apr '12, 09:26)
peppert
compared against hadoop fs -put of the same 5GB random data file: Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn sda 0.00 0.00 0.00 0 0 sdd 612.00 0.00 141112.00 0 141112 sdb 609.00 0.00 139880.00 0 139880 sdc 612.00 0.00 140200.00 0 140200 sde 631.00 0.00 145448.00 0 145448 sdf 634.00 0.00 146104.00 0 146104 sdg 634.00 0.00 146896.00 0 146896 so, clearly, mapr can write to the disks at the interface rate. nfs seems to be the limiting factor here.
(28 Apr '12, 09:37)
peppert
/opt/mapr/server/mfs -b -f /ramfs/mapr/cachefile -p 5660 -n inode:6:log:6:meta:10:dir:40:small:15 -m 8192 -O /opt/mapr/conf/mapr-clusters.conf -i 16384
(28 Apr '12, 14:17)
peppert
Thanks! Can you edit /opt/mapr/conf/mfs.conf and change the following line mfs.cache.lru.sizes=inode:6:log:6:meta:10:dir:40:small:15 to mfs.cache.lru.sizes=inode:6:log:6:meta:6:dir:10:small:25 The setting you have is for a map-reduce env, not for a NFS env. I will try to reproduce your NFS perf problem back here at MapR.
(28 Apr '12, 14:58)
MC Srivas ♦♦
Don't forget to restart MFS after this change.
(28 Apr '12, 16:20)
TedDunning ♦♦
Thanks guys - somehow answers de-subscribed me to this question so I missed this until just now. Will try these changes immediately.
(02 May '12, 09:57)
peppert
showing 5 of 6
show all
|