|
I setup a 6-nodes cluster for 1 week, and I find the data seems not well balanced among all disks in the cluster. Take today's data for instance:
Data seems prefer to the master-node, but why? And what is the strategy for the data-balancing? How can I balance data in my cluster? |
|
The MapR system balances automatically by moving data from nodes that are more full than the cluster average. A node that is within +/-10% of the cluster-average is considered to be within average fullness. Additionally, by default the balancer doesn't move data from a node unless it is atleast 70% full. This behavior can be changed by modifying the config variable "cldb.balancer.disk.threshold.percentage". In the instance that you pasted, all of the nodes have disk-utilization within 10% of the average. And, all of them are < 70%. So, the balancer would take no action. By master, I'm assuming you mean the CLDB node. MapR tries hard to keep the first copy of every write local - if this causes an anamoly that causes some nodes to have excessive space utilization compared to the rest of the cluster (+/- 10% of cluster-avg), the balancer will fix it. Please look at http://www.mapr.com/doc/display/MapR/Balancers for additional details regarding the balancer. yes, "master" is the hostname of my cluster which CLDB is running on. Great balance strategy! Well, I try lots of test on the CLDB-node including send test-data using 'hadoop fs -put ...' . I checked the test-data size and found that the overhead on the CLDB-node matchs the "keep the first copy of every write local" strategy. So, there is no problem at all. Great system! Thank you very much!
(22 Sep '11, 19:47)
zoglee
|
|
hadoop@lord:~$ /opt/mapr/bin/maprcli config load -json | grep "cldb.balancer"
I have 18 system-volumes and 11 user-data-volumes in my cluster. The data in system-volumes is a little, most data is in the user-data-volume:
|