This is a roundabout way of asking the "which storage is more expensive for Hadoop - SAN or distributed DAS". If you can dial the FS replication factor down to 1 when using a SAN, then the storage cost equation favors SAN over default 3x replication factor on distributed DAS.
Yes, there is a 3x cost to replication, but there is also a 3x performance increase and the increase in cost comes off of a baseline that is about 30x better than the net cost of most SAN storage. The net is that distributed local storage wins most price/performance comparisons very easily because of the massive residual advantage.
Your mileage will vary and a detailed comparison is a very complex computation, but with storage heavy nodes the numbers are hard to overcome.
As an example, with conventional hardware, 12 drives can move about a GB per second using MapR. With sufficient networking bandwidth, this gives you an aggregate bandwidth of about a TB/s for a 1000 node cluster. Smaller clusters also show very impressive performance. Even with stock Hadoop you can get about a third of this bandwidth which isn't all that shabby.
You mileage will vary, of course, and if somebody is giving you a big SAN for free and paying the power costs, you probably should come to a different answer. Likewise, if you are using EC2, their SAN is your only real choice for persistent storage.
answered 28 Mar '12, 14:01
Within the context of this thread, the what if a data node fails" question comes up. Here is my crack at an answer to that....
In a typical cluster using distributed DAS with “N” replication factor:
In an atypical cluster that is using shared storage…
If my assumptions are correct, then using shared storage for hadoop doesn’t win us anything unless the hadoop distribution is smart enough see a data node failure as not affecting storage components locally assigned to it, and it must have the ability to reassign those local storage components to a healthy compute node.
answered 28 Mar '12, 14:58