We have two NICs on each MapR node, and MapR provides application level NIC bonding (this is great!). However, when one interface that is bound to the hostname fails, access from other nodes seems to be lost, and the node will be 'critical' on the Control Center dashboard even if another interface is still alive. Is this means that MapR provides load-balancing but doesn't provide NIC fault tolerance?

Likewise, in case that each server is connected to multiple networks but all interfaces which are bound to the node's hostnames are connected to a single switch, could it be a single-point-of-failure?

asked 18 Apr '12, 05:43

nagix's gravatar image

nagix
569913
accept rate: 0%


No, this means that your cluster isn't fully configured to take advantage of NIC redundancy. I will let somebody else drill down into exactly what is happening, but the basic idea is that the multiple interfaces have to be discoverable.

Regarding your second point about pushing all of the network interfaces into a single switch, the answer is yes, that can cause a single hardware point of failure. That is why we recommend two top of rack switches when you use NIC bonding.

link

answered 18 Apr '12, 11:13

TedDunning's gravatar image

TedDunning ♦♦
2.4k315
accept rate: 28%

I have three nodes: node1, node2 and node3. And each node has two NICs: eth0 and eth1.

/etc/hosts is as follows:

192.168.1.1   node1
192.168.1.2   node2
192.168.1.3   node3
192.168.2.1   node1-2
192.168.2.2   node2-2
192.168.2.3   node3-2

ifconfig -a on node1 (similar on node2 and node3):

eth0      Link encap:Ethernet  HWaddr 84:2B:2B:4E:87:72  
          inet addr:192.168.1.1  Bcast:192.168.1.255  Mask:255.255.255.0
          inet6 addr: fe80::862b:2bff:fe4e:8772/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:52018 errors:0 dropped:0 overruns:0 frame:0
          TX packets:27 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5245690 (5.0 MiB)  TX bytes:3963 (3.8 KiB)

eth1      Link encap:Ethernet  HWaddr 00:05:33:26:91:6B  
          inet addr:192.168.2.1  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::205:33ff:fe26:916b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:93482153 errors:0 dropped:0 overruns:0 frame:0
          TX packets:14623336 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:123595030756 (115.1 GiB)  TX bytes:197675978064 (184.1 GiB)

lo        Link encap:Local Loopback
...

The cluster is configured by /opt/mapr/server/configure -C node1,node2,node3 -Z node1,node2,node3

When I disabled the NIC eth0 on node1 by executing 'ifconfig eth0 down', which is bound to the IP address 192.168.1.1, node1 became the Critical state after 5 minutes and all the services on node1 seemed to become disabled, while eth1 is still up. On node2, 'hadoop fs -ls /' worked but the following warning appeared.

[root] # hadoop fs -ls /
2012-04-24 03:37:56,5735 ERROR Cidcache fs/client/fileclient/cc/cidcache.cc:1046 Thread: 1098754368 Lookup of volume mapr.cluster.root failed, error Connection reset by peer(104), CLDB: 192.168.1.1:7222 trying another CLDB
Found 3 items
drwxrwxrwx   - root root          1 2012-04-10 23:42 /user
drwxrwxrwx   - root root         41 2012-04-19 23:54 /hbase
drwxrwxrwx   - root root          1 2011-09-15 14:50 /var

In addition, when executing a MapRecuce job, there was a few minutes pause at the beginning of the job.

Any thoughts?

link

answered 24 Apr '12, 04:03

nagix's gravatar image

nagix
569913
accept rate: 0%

The configure.sh script writes the name to mapr-clusters.conf file. If you want to utilize the multiple nics on that node, run configure.sh with both those names

/opt/mapr/server/configure -C node1,node1-2,node2,node3 -Z node1,node1-2,node2,node3

etc

link

answered 24 Apr '12, 04:24

Nabeel's gravatar image

Nabeel ♦♦
2.3k147
accept rate: 23%

First, I tried the following configuration:

[root] # /opt/mapr/server/configure.sh -C node1,node2,node3,node1-2,node2-2,node3-2 -Z node1,node2,node3,node1-2,node2-2,node3-2

But this doesn't work since zookeeper failed to start. Then I tried the following cases:

(1)

[root] # /opt/mapr/server/configure.sh -C node1,node2,node3,node1-2,node2-2,node3-2 -Z node1,node2,node3,node1-2,node2-2

(2)

[root] # /opt/mapr/server/configure.sh -C node1,node2,node3,node1-2,node2-2,node3-2 -Z node1,node2,node3,node1-2

(3)

[root] # /opt/mapr/server/configure.sh -C node1,node2,node3,node1-2,node2-2,node3-2 -Z node1,node2,node3

In those three cases, zookeeper successfully started and the cluster got active. When I disabled eth0 on node1, cldb service on node1 shutdown in either case. After disabling eth0, the node1 was still Healthy state. I guess this is because FileServer was able to communicate through eth1.

But, in case (3), all services on node1 became 'not configured'.

What is the best configuration in case of this 3 node cluster?

link

answered 25 Apr '12, 12:43

nagix's gravatar image

nagix
569913
accept rate: 0%

Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×5
×5
×1
×1

Asked: 18 Apr '12, 05:43

Seen: 702 times

Last updated: 02 Nov '12, 05:05

powered by OSQA