|
We've been running some of our bigger jobs our MapR cluster and I've been noticing that it's possible for the "Running Map Tasks" count to be greater than Map Task Capacity. How do we make the cluster enforce the capacity? Thanks in advance for any help that anyone can provide. |
|
Interesting topic... I've noticed this as well. sxsnyc, in your real-world tests, have you come to a conclusion of an optimal prefect.maptasks value, or some other sort of general conclusion on the matter? I assume the prefetch feature was introduced to improve performance on map tasks, but I have not played around with the number and benchmarked yet. |
We're still experiencing some resource consumption issues for jobs with long running mappers. Anyone have any ideas?
We've adjusted the mapreduce.tasktracker.prefetch.maptasks in mapred-site.xml for each TaskTracker to .10. This reduced the number of total mappers allocated but it still creates unresponsiveness from NFS and the JobTracker.
Does it make sense for the JobTacker to allocate prefetched MapTasks to a Job that already has over the Map Task Capacity?
OK. Now, we're setting mapred.running.map.limit to set the cluster-wide limit on running map tasks for the job, but it does not seem to have any affect on the running job.
Prefetch slots don't actually launch more task than your cluster capacity. They just make tasktracker accept more tasks than max map slots. This creates pipeline and eliminates scheduling overheads. If you have many jobs running at same time you can always turn off prefetch slots by setting mapreduce.tasktracker.prefetch.maptasks =0 on each TaskTracker.
mapred.running.map.limit is introduced in JIRA https://issues.apache.org/jira/browse/HADOOP-5170 and is backed out. If you are using fair scheduler then you can set limits on each pool. See http://hadoop.apache.org/common/docs/current/fair_scheduler.html