I have a lab cluster with 6 nodes, each with 3 50 GB "disks" (vmware) allocated as raw storage, so 150GB per datanode and 900GB total. Replication factor is set to 3.

I've noticed that attempting to cp or rsync files onto the NFS gateway breaks if the write goes to any disk that has less free space available than the size of the file I'm attempting to copy onto it.

Am I correct in assuming that mapr will not span multiple disks in a large file write? Am I therefore constrained to file sizes smaller than the smallest free space available on any disk in the pool at the time of the write?

How would I work around this problem?

Thanks!

asked 11 Feb, 16:22

peppert's gravatar image

peppert
5729
accept rate: 0%


You should not have any limitations of this kind in a normally operating system.

Data within a storage pool is striped across the disks in the pool. Moreover files are normally split into chunks, each of which is stored on a different storage pool.

If the chunksize is larger than available space, then you might have a problem. Can you say more about how much space you have available? (thanks, btw, for giving good information about the rest of your system configuration)

link

answered 11 Feb, 19:18

TedDunning's gravatar image

TedDunning ♦♦
1.3k39
accept rate: 41%

When writing, with 3x replication (well, even with 1x repl) the first copy is always created on the local machine. So if the NFS server is running on the "datanode", then it will create the first copy local, ie, on the same machine as itself. Any imbalance on space utilization will eventually be smoothened out. Note that by default the balancers are turned off. See http://mapr.com/doc/display/MapR/Balancers for documentation about how the balancers work, and how to turn them on.

link

answered 12 Feb, 11:03

MC%20Srivas's gravatar image

MC Srivas ♦♦
1.9k215
accept rate: 39%

Thanks for your answer - it inspired me to de-install, rm -rf /opt, and re-install the cluster from scratch today and repeat the test.

It went much better this time - striping performed as advertised vs. yesterday, though I'm noticing the local disks on the nfs node are filling at a far faster rate than the other members in the cluster, nearly 2:1. I suspect it takes some time to backfill the new data off to other datanodes in the cluster and eventually they will all reach parity? Is there a best-practice to minimize this effect? Should I even run a datanode on the NFS node?

When I first created the cluster, I assigned only a single 50GB volume to the NFS node, and subsequently added an additional 100MB disk after it was already running. I wonder if that somehow exposed the behavior I was seeing yesterday, where only a single volume was filling until it was full (although the UI reported all 3 disks were available and eventual consistency would spread the data across the disks more or less evenly). Does hadoop/mapr expect disks used for data storage to be of identical size across each host? Throughout the cluster?

Thanks much.

link

answered 12 Feb, 10:51

peppert's gravatar image

peppert
5729
accept rate: 0%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×2
×1
×1

Asked: 11 Feb, 16:22

Seen: 194 times

Last updated: 12 Feb, 11:03

Related questions

powered by OSQA