|
We have two NICs on each MapR node, and MapR provides application level NIC bonding (this is great!). However, when one interface that is bound to the hostname fails, access from other nodes seems to be lost, and the node will be 'critical' on the Control Center dashboard even if another interface is still alive. Is this means that MapR provides load-balancing but doesn't provide NIC fault tolerance? Likewise, in case that each server is connected to multiple networks but all interfaces which are bound to the node's hostnames are connected to a single switch, could it be a single-point-of-failure? |
|
No, this means that your cluster isn't fully configured to take advantage of NIC redundancy. I will let somebody else drill down into exactly what is happening, but the basic idea is that the multiple interfaces have to be discoverable. Regarding your second point about pushing all of the network interfaces into a single switch, the answer is yes, that can cause a single hardware point of failure. That is why we recommend two top of rack switches when you use NIC bonding. |
|
I have three nodes: node1, node2 and node3. And each node has two NICs: eth0 and eth1. /etc/hosts is as follows:
ifconfig -a on node1 (similar on node2 and node3):
The cluster is configured by /opt/mapr/server/configure -C node1,node2,node3 -Z node1,node2,node3 When I disabled the NIC eth0 on node1 by executing 'ifconfig eth0 down', which is bound to the IP address 192.168.1.1, node1 became the Critical state after 5 minutes and all the services on node1 seemed to become disabled, while eth1 is still up. On node2, 'hadoop fs -ls /' worked but the following warning appeared.
In addition, when executing a MapRecuce job, there was a few minutes pause at the beginning of the job. Any thoughts? |
|
The configure.sh script writes the name to mapr-clusters.conf file. If you want to utilize the multiple nics on that node, run configure.sh with both those names /opt/mapr/server/configure -C node1,node1-2,node2,node3 -Z node1,node1-2,node2,node3 etc |
|
First, I tried the following configuration:
But this doesn't work since zookeeper failed to start. Then I tried the following cases: (1)
(2)
(3)
In those three cases, zookeeper successfully started and the cluster got active. When I disabled eth0 on node1, cldb service on node1 shutdown in either case. After disabling eth0, the node1 was still Healthy state. I guess this is because FileServer was able to communicate through eth1. But, in case (3), all services on node1 became 'not configured'. What is the best configuration in case of this 3 node cluster? |