I'm having issues with getting the JobTracker to stay alive.

While trying to get Mahout working, I start of by running

env JAVA_HOME=$JAVA_HOME HADOOP_CONF_DIR=$HADOOP_CONF_DIR ./build-20news-bayes.sh

However, when I start a JobTracker from the MapR Control System, it starts running. fs.JobTrackerWatcher is able to find it, but it ends up not being able to connect. Around the same time it starts to try connecting, it enters failed state and breaks. I'm not sure where to start troubleshooting this, port 9001 is open within the EC2 Security Group so it should be able to connect ok right?

Some logs:

12/04/09 20:15:30 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-116-223-132.ec2.internal/10.116.223.132:9001
12/04/09 20:15:31 INFO ipc.Client: Retrying connect to server: ip-10-116-223-132.ec2.internal/10.116.223.132:9001. Already tried 0 time(s).
12/04/09 20:15:32 INFO ipc.Client: Retrying connect to server: ip-10-116-223-132.ec2.internal/10.116.223.132:9001. Already tried 1 time(s).
12/04/09 20:15:33 INFO ipc.Client: Retrying connect to server: ip-10-116-223-132.ec2.internal/10.116.223.132:9001. Already tried 2 time(s).
12/04/09 20:15:34 INFO ipc.Client: Retrying connect to server: ip-10-116-223-132.ec2.internal/10.116.223.132:9001. Already tried 3 time(s).
12/04/09 20:15:35 INFO ipc.Client: Retrying connect to server: ip-10-116-223-132.ec2.internal/10.116.223.132:9001. Already tried 4 time(s).
12/04/09 20:15:36 INFO ipc.Client: Retrying connect to server: ip-10-116-223-132.ec2.internal/10.116.223.132:9001. Already tried 5 time(s).
12/04/09 20:15:37 INFO ipc.Client: Retrying connect to server: ip-10-116-223-132.ec2.internal/10.116.223.132:9001. Already tried 6 time(s).
12/04/09 20:15:38 INFO ipc.Client: Retrying connect to server: ip-10-116-223-132.ec2.internal/10.116.223.132:9001. Already tried 7 time(s).
12/04/09 20:15:39 INFO ipc.Client: Retrying connect to server: ip-10-116-223-132.ec2.internal/10.116.223.132:9001. Already tried 8 time(s).
12/04/09 20:15:40 INFO ipc.Client: Retrying connect to server: ip-10-116-223-132.ec2.internal/10.116.223.132:9001. Already tried 9 time(s).
12/04/09 20:15:40 INFO ipc.RPC: FailoverProxy: Server on ip-10-116-223-132.ec2.internal/10.116.223.132:9001 is lost due to java.net.SocketException: Call to ip-10-116-223-132.ec2.internal/10.116.223.132:9001 failed on socket exception in call getStagingAreaDir
12/04/09 20:15:40 INFO ipc.RPC: Searching for the Active Server ...
12/04/09 20:15:40 INFO ipc.RPC: Attempt# 1 . Trying to connect Server at ip-10-116-223-132.ec2.internal/10.116.223.132:9001

asked 09 Apr '12, 13:21

tristanls's gravatar image

tristanls
1222
accept rate: 0%

edited 09 Apr '12, 13:22


Do you run mahout from EC2 as well as your cluster?

link

answered 09 Apr '12, 18:04

yufeldman's gravatar image

yufeldman ♦♦
1.9k27
accept rate: 25%

yes, everything is in the same security group on EC2; in the log above, both the JobTracker and Mahout are installed on 10.116.223.132;

(09 Apr '12, 18:31) tristanls

Does JobTracker actually run? If yes, could you telnet to both ip-10-116-223-132.ec2.internal and 10.116.223.132 on port 9001 from where you are running mahout? I know it sounds strange, but your errors are also strange

(09 Apr '12, 22:47) yufeldman ♦♦

Ok, I found warden.log (I was looking for jobtracker.log). And in there it clearly states why JobTracker breaks:

2012-04-10 21:09:04,403 ERROR com.mapr.warden.service.baseservice.Service$ServiceRun run [jobtracker_monitor]: Error while running command: [nice, -n, -10, /opt/mapr/hadoop/hadoop-0.20.2/bin/hadoop-daemon.sh, start, jobtracker]
2012-04-10 21:09:04,403 ERROR com.mapr.warden.service.baseservice.Service$ServiceRun run [jobtracker_monitor]:     +======================================================================+
|      Error: JAVA_HOME is not set and Java could not be found         |
+----------------------------------------------------------------------+
| Please download the latest Sun JDK from the Sun Java web site        |
|       > http://java.sun.com/javase/downloads/ <                      |
|                                                                      |
| Hadoop requires Java 1.6 or later.                                   |
| NOTE: This script will find Sun Java whether you install using the   |
|       binary or the RPM based installer.                             |
+======================================================================+

What I don't understand now, is why JAVA_HOME is not set? Because:

$ cat /etc/environment 
MAHOUT_HOME=/opt/mapr/mahout/mahout-0.5
HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2
HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-0.20.2/conf
JAVA_HOME=/opt/java/64/jre1.6.0_31
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games"
link

answered 10 Apr '12, 14:13

tristanls's gravatar image

tristanls
1222
accept rate: 0%

Ok, the "is JobTracker running?" problem is solved. My mistake was to start mapr-warden before setting /etc/environment. JobTracker is no longer an issue.

What is an issue is the mistery "Retrying connect to server" logs, however, they are followed by a "normal" failure, so this might be going well

12/04/10 21:20:52 INFO bayes.TrainClassifier: Training Bayes Classifier
12/04/10 21:20:53 INFO bayes.BayesDriver: Reading features...
12/04/10 21:20:54 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-2-65-155.ec2.internal/10.2.65.155:9001
12/04/10 21:20:55 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 0 time(s).
12/04/10 21:20:56 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 1 time(s).
12/04/10 21:20:57 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 2 time(s).
12/04/10 21:20:58 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 3 time(s).
12/04/10 21:20:59 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 4 time(s).
12/04/10 21:21:00 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 5 time(s).
12/04/10 21:21:01 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 6 time(s).
12/04/10 21:21:02 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 7 time(s).
12/04/10 21:21:03 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 8 time(s).
12/04/10 21:21:04 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 9 time(s).
12/04/10 21:21:04 INFO ipc.RPC: FailoverProxy: Server on ip-10-2-65-155.ec2.internal/10.2.65.155:9001 is lost due to java.net.SocketException: Call to ip-10-2-65-155.ec2.internal/10.2.65.155:9001 failed on socket exception in call getStagingAreaDir
12/04/10 21:21:04 INFO ipc.RPC: Searching for the Active Server ...
12/04/10 21:21:04 INFO ipc.RPC: Attempt# 1 . Trying to connect Server at ip-10-2-65-155.ec2.internal/10.2.65.155:9001
12/04/10 21:21:05 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 0 time(s).
12/04/10 21:21:06 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 1 time(s).
12/04/10 21:21:07 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 2 time(s).
12/04/10 21:21:08 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 3 time(s).
12/04/10 21:21:09 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 4 time(s).
12/04/10 21:21:10 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 5 time(s).
12/04/10 21:21:11 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 6 time(s).
12/04/10 21:21:12 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 7 time(s).
12/04/10 21:21:13 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 8 time(s).
12/04/10 21:21:14 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 9 time(s).
12/04/10 21:21:14 WARN ipc.RPC: Error connecting server at ip-10-2-65-155.ec2.internal/10.2.65.155:9001 java.net.SocketException: Call to ip-10-2-65-155.ec2.internal/10.2.65.155:9001 failed on socket exception
12/04/10 21:21:14 INFO ipc.RPC: Tried all servers sleeping
12/04/10 21:21:16 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-2-65-155.ec2.internal/10.2.65.155:9001
12/04/10 21:21:16 INFO ipc.RPC: Attempt# 2 . Trying to connect Server at ip-10-2-65-155.ec2.internal/10.2.65.155:9001
12/04/10 21:21:17 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 0 time(s).
12/04/10 21:21:18 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 1 time(s).
12/04/10 21:21:19 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 2 time(s).
12/04/10 21:21:20 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 3 time(s).
12/04/10 21:21:21 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 4 time(s).
12/04/10 21:21:22 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 5 time(s).
12/04/10 21:21:23 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 6 time(s).
12/04/10 21:21:24 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 7 time(s).
12/04/10 21:21:25 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 8 time(s).
12/04/10 21:21:26 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 9 time(s).
12/04/10 21:21:26 WARN ipc.RPC: Error connecting server at ip-10-2-65-155.ec2.internal/10.2.65.155:9001 java.net.SocketException: Call to ip-10-2-65-155.ec2.internal/10.2.65.155:9001 failed on socket exception
12/04/10 21:21:26 INFO ipc.RPC: Tried all servers sleeping
12/04/10 21:21:30 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-2-65-155.ec2.internal/10.2.65.155:9001
12/04/10 21:21:30 INFO ipc.RPC: Attempt# 3 . Trying to connect Server at ip-10-2-65-155.ec2.internal/10.2.65.155:9001
12/04/10 21:21:31 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 0 time(s).
12/04/10 21:21:32 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 1 time(s).
12/04/10 21:21:33 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 2 time(s).
12/04/10 21:21:34 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 3 time(s).
12/04/10 21:21:35 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 4 time(s).
12/04/10 21:21:36 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 5 time(s).
12/04/10 21:21:37 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 6 time(s).
12/04/10 21:21:38 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 7 time(s).
12/04/10 21:21:39 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 8 time(s).
12/04/10 21:21:40 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 9 time(s).
12/04/10 21:21:40 WARN ipc.RPC: Error connecting server at ip-10-2-65-155.ec2.internal/10.2.65.155:9001 java.net.SocketException: Call to ip-10-2-65-155.ec2.internal/10.2.65.155:9001 failed on socket exception
12/04/10 21:21:40 INFO ipc.RPC: Tried all servers sleeping
12/04/10 21:21:46 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-2-65-155.ec2.internal/10.2.65.155:9001
12/04/10 21:21:46 INFO ipc.RPC: Attempt# 4 . Trying to connect Server at ip-10-2-65-155.ec2.internal/10.2.65.155:9001
12/04/10 21:21:47 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 0 time(s).
12/04/10 21:21:48 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 1 time(s).
12/04/10 21:21:49 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 2 time(s).
12/04/10 21:21:50 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 3 time(s).
12/04/10 21:21:51 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 4 time(s).
12/04/10 21:21:52 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 5 time(s).
12/04/10 21:21:53 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 6 time(s).
12/04/10 21:21:54 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 7 time(s).
12/04/10 21:21:55 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 8 time(s).
12/04/10 21:21:56 INFO ipc.Client: Retrying connect to server: ip-10-2-65-155.ec2.internal/10.2.65.155:9001. Already tried 9 time(s).
12/04/10 21:21:56 WARN ipc.RPC: Error connecting server at ip-10-2-65-155.ec2.internal/10.2.65.155:9001 java.net.SocketException: Call to ip-10-2-65-155.ec2.internal/10.2.65.155:9001 failed on socket exception
12/04/10 21:21:56 INFO ipc.RPC: Tried all servers sleeping
12/04/10 21:22:04 INFO fs.JobTrackerWatcher: Current running JobTracker is: ip-10-2-65-155.ec2.internal/10.2.65.155:9001
12/04/10 21:22:04 INFO ipc.RPC: Attempt# 5 . Trying to connect Server at ip-10-2-65-155.ec2.internal/10.2.65.155:9001
12/04/10 21:22:04 INFO ipc.RPC: New Active server found on ip-10-2-65-155.ec2.internal/10.2.65.155:9001
12/04/10 21:22:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/04/10 21:22:06 INFO mapred.JobClient: Cleaning up the staging area maprfs://10.2.65.155:7222/var/mapr/cluster/mapred/jobTracker/staging/root/.staging/job_201204102116_0001
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: /user/root/examples/bin/work/20news-bydate/bayes-train-input
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:225)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:236)
    at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1005)
    at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:997)
    at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:914)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:867)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Unknown Source)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1109)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:867)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:841)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1276)
    at org.apache.mahout.classifier.bayes.mapreduce.common.BayesFeatureDriver.runJob(BayesFeatureDriver.java:63)
    at org.apache.mahout.classifier.bayes.mapreduce.bayes.BayesDriver.runJob(BayesDriver.java:47)
    at org.apache.mahout.classifier.bayes.TrainClassifier.trainNaiveBayes(TrainClassifier.java:54)
    at org.apache.mahout.classifier.bayes.TrainClassifier.main(TrainClassifier.java:162)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:186)

My guess here is that Hadoop is looking for training data to be available on the JobTracker machine, instead of where the job originated from, i.e. the Mahout machine.

link

answered 10 Apr '12, 14:28

tristanls's gravatar image

tristanls
1222
accept rate: 0%

Hadoop most likely is looking for input in maprfs where it is not there I guess.

link

answered 10 Apr '12, 14:39

yufeldman's gravatar image

yufeldman ♦♦
1.9k27
accept rate: 25%

it looks to me like it's using $HOME instead of $MAHOUT_HOME because it's appending /user/root to the path that it claims doesn't exist.

(10 Apr '12, 14:46) tristanls
Your answer
toggle preview

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Markdown Basics

  • *italic* or __italic__
  • **bold** or __bold__
  • link:[text](http://url.com/ "title")
  • image?![alt text](/path/img.jpg "title")
  • numbered list: 1. Foo 2. Bar
  • to add a line break simply add two spaces to where you would like the new line to be.
  • basic HTML tags are also supported

Tags:

×87
×32
×9
×5

Asked: 09 Apr '12, 13:21

Seen: 1,124 times

Last updated: 02 Nov '12, 05:04

powered by OSQA