|
I have a lab cluster with 6 nodes, each with 3 50 GB "disks" (vmware) allocated as raw storage, so 150GB per datanode and 900GB total. Replication factor is set to 3. I've noticed that attempting to cp or rsync files onto the NFS gateway breaks if the write goes to any disk that has less free space available than the size of the file I'm attempting to copy onto it. Am I correct in assuming that mapr will not span multiple disks in a large file write? Am I therefore constrained to file sizes smaller than the smallest free space available on any disk in the pool at the time of the write? How would I work around this problem? Thanks! |
|
You should not have any limitations of this kind in a normally operating system. Data within a storage pool is striped across the disks in the pool. Moreover files are normally split into chunks, each of which is stored on a different storage pool. If the chunksize is larger than available space, then you might have a problem. Can you say more about how much space you have available? (thanks, btw, for giving good information about the rest of your system configuration) |
|
When writing, with 3x replication (well, even with 1x repl) the first copy is always created on the local machine. So if the NFS server is running on the "datanode", then it will create the first copy local, ie, on the same machine as itself. Any imbalance on space utilization will eventually be smoothened out. Note that by default the balancers are turned off. See http://mapr.com/doc/display/MapR/Balancers for documentation about how the balancers work, and how to turn them on. |
|
Thanks for your answer - it inspired me to de-install, rm -rf /opt, and re-install the cluster from scratch today and repeat the test. It went much better this time - striping performed as advertised vs. yesterday, though I'm noticing the local disks on the nfs node are filling at a far faster rate than the other members in the cluster, nearly 2:1. I suspect it takes some time to backfill the new data off to other datanodes in the cluster and eventually they will all reach parity? Is there a best-practice to minimize this effect? Should I even run a datanode on the NFS node? When I first created the cluster, I assigned only a single 50GB volume to the NFS node, and subsequently added an additional 100MB disk after it was already running. I wonder if that somehow exposed the behavior I was seeing yesterday, where only a single volume was filling until it was full (although the UI reported all 3 disks were available and eventual consistency would spread the data across the disks more or less evenly). Does hadoop/mapr expect disks used for data storage to be of identical size across each host? Throughout the cluster? Thanks much. |