I have a single node, Dell R720XD host with 12 3TB SAS 7.2k disks off of a perc 710 w/ 1 GB write cache running 18.104.22.16861.GA under Ubuntu 12.04 beta 2 amd64 and the latest Sun JDK. All disks are configured as individual RAID0 volumes (13 total virtual disks including the RAID1 OS drives which are on 2 dedicated drives outside of the storage pool.) CPU is 2 six-core Xeon E5-2620. 16GB RAM.
When using a non-mapr software RAID solution like mdraid 0 or zfs, I'm able to perform sustained sequential writes and reads to the combined volume at a rate of between 1.1GB/sec and 1.5GB/sec. I get similar performance when accessing said software RAID volume via NFS via localhost.
When accessing the same volume via MapR NFS (configured to access the raw /dev/sd_ devices individually), performance drops to about 200MB/sec, and parallel access to the nfsserver process reduces performance linearally as processes increase, to the point where 10 concurrent writes of a 1 gig file cached entirely in RAM result in 20MB/sec of throughput. Disabling compression has no effect.
I've got 2 10GbE adapters in this host (and the other hosts that I was hoping to put in an eventual M5 cluster), and was rather hoping to see performance closer to 600-800MB/sec for large sequential reads and writes via NFS. What can I do to help things along?
Thanks in advance.
asked 27 Apr '12, 19:02
I am unable to figure out from your description what your test is doing.
(a) please describe your test, or better yet, post your test program in pastebin (place a link to it from here), and drop a description about how you are running the test
(b) is this on one machine, with no network? if not, how many machines, what is the replication factor?
(c) what is the switch between them ... have you verified that the switch works OK? if so, how did you do it?
(d) Are the NICs bonded? If not, are you letting MapR figure it out? If the boxes have a 100 MB/s NIC for admin, how are you ensuring that the 100MB/s NIC isn't getting used by MapR? (see the env var MAPR_SUBNETS in the docs at http://mapr.com/doc)
answered 27 Apr '12, 20:21
MC Srivas ♦♦
In addition to Srivas comments, the use of a volume manager to carve 12 disks into 13 virtual disks may cause some confusion about write scheduling. A simpler arrangement where the majority of the virtual disks represent exactly a single physical drive could help a lot. Carving the RAID-1 OS partition out like that may require some slightly fancy footwork to avoid effectively losing a bit of space on the other drives.
One suggestion is to turn the first two drives into four virtual drives, two for OS and two for storage. Then you can put the two larger partitions on these disks in a single disk group and arrange the remaining 11 drives into congenial groupings. If you are only using a relatively small boot space (100G or so), then you might be willing to let the 3% lossage go unnoticed and simply let the system configure your disk groups for you. The lossage occurs because block devices in a disk group should be the same size.
I would also worry a bit if your logical volume manager is striping writes as seems to be implied by your independent transfer rate test. That could mean that each of your virtual drives is actually resident on all of your drives and the resulting write patterns are likely to be very confusing to the MapR software.
The other (slight) oddity in your configuration is the fact that you have CPU-heavy nodes without much memory. The normal recommendation is 4-6GB per core. You have 12 cores on each node so 48-72GB would be normal. Without enough memory, you may wind up a bit starved in terms of map-reduce slots depending on how much memory you give each slot and how much overcommit you allow.
One further test you can do with your system is to verify each virtual volume (in your configuration or the single disk per volume configuration) using dd. You should be able to transfer data at pretty much exactly the disk max I/O rate less whatever overhead your controller imposes. This should be 100-150MB/s. If you get more than that, then your controller is getting fancy under the covers in an undesirable way. If you get much less than that, then there is a separate problem. You should also be able to do dd's to each separate block device and get independent transfers at the same maximum rate with an aggregate at about the same level as you saw with your RAID in place.
answered 27 Apr '12, 21:01
Lets do a couple of very simple tests first, before doing all kinds of striping via LVM.
a) use 6 raw drives only (no LVM or anything in-between the disks and MapR). (probably might have to reformat the cluster if this is the only node ... make sure you completely clear out the Zookeeper data in /opt/mapr/zkdata when you do so)
b) turn off MapR's compression, since you want to test random data. To turn it off via NFS,
and set compression=false, and save the file (see http://mapr.com/doc/display/MapR/dfs-attributes+dir for more info). When randfiledest gets created, it will inherit the "compression=false" property from its parent dir.
Run the test, and lets make sure you see about 300 MB/s consistently. We can work our way to higher numbers after that.
answered 27 Apr '12, 21:39
MC Srivas ♦♦
Thx. Can you write a second file with "oflag=sync" added to the dd command line? I think the NFS client might be choking.
Also, I am surprised only 1 drive is being used ... the disksetup command should've created two raid-0 groups of 3 drives each, and alternate between each raid-0 group every 256M. So it should be using all 6, but only 3 at a time ... which is why I mentioned the 300MB/s number. Can you paste the output of "/opt/mapr/server/mrconfig sp info"?
answered 27 Apr '12, 22:31
MC Srivas ♦♦
I meant using oflag=sync in this command that writes a file to MapR (that you showed on pastebin), and not on /dev/sdi
Can you paste /opt/mapr/logs/mfs.log into pastebin? I will take a look at whats happening. Its quite strange, you should see much better perf ... about 400 MB/s or so on those 3 drives. Please paste the entire mfs.log, as well as /opt/mapr/logs/disksetup.log into pastebin and give me a ptr. Thanks!
answered 27 Apr '12, 22:47
MC Srivas ♦♦
THanks for pasting, but looks like the logs rolled over. What I wanted to see was how mfs was started. Show the output of "ps -ef | grep mfs", and of the first 500 lines of the original mfs.log.
Another interesting test to try would be to write several streams to MapR simultaneously, in the same dir. Eg,
for i in $(seq 0 10); do dd if=/dev/zero of=/mapr/my.cluster.com/volume/file.$i bs=1M count=5k oflag=sync & done
You can look at the aggregate using "iostat" and let me know what you see. You should start seeing about 600MB/s or so, given that you have 6 drives.
answered 28 Apr '12, 08:10
MC Srivas ♦♦