Monday, May 25, 2015

How to run a Map Reduce jar using Oozie workflow

In this article we will go through the steps to use Oozie for executing map reduce programs / jar files provided by hadoop distribution. You may use this as a base and setup your applications / workflow accordingly
Environment
  • PHD 1.1.1 / Oozie 3.3.2
Prerequisite
  • Basic Oozie knowledge & Working Oozie, PHD cluster
Example used
  • Wordcount program from hadoop-mapreduce-examples.jar
  • hadoop-mapreduce-examples.jar is a symbolic link to the jar provided in the Pivotal Hadoop Distribution.
Below is the list of steps
1) Untar the jar hadoop-mapreduce-examples.jar. You can find this jar under /usr/lib/gphd/hadoop-mapreduce directory on a Pivotal Hadoop cluster.
[hadoop@hdm1 test]$ jar xf hadoop-mapreduce-examples.jar
[hadoop@hdm1 test]$ ls
hadoop-mapreduce-examples.jar META-INF org
2) Navigate to the directory to see the list of class files associated with WordCount. 
[hadoop@hdm1 test]$ cd org/apache/hadoop/examples/
[hadoop@hdm1 examples]$ ls WordCount*
WordCount.class WordCount$IntSumReducer.class WordCount$TokenizerMapper.class
In WordCount program, name of the mapper class is WordCount$TokenizerMapper.class and reducer class is WordCount$TokenizerMapper.class. We will use these file when defining the oozie workflow.xml
3) Create a job.properties file. The parameters for the Oozie job are provided in a Java properties file (.properties) or a Hadoop configuration xml (.xml), in this case we use a .properties file.
nameNode=hdfs://phdha
jobTracker=hdm1.phd.local:8032
queueName=default
examplesRoot=examplesoozie
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/map-reduce
outputDir=map-reduce

where:
namenode = Variable to define the namenode path by which HDFS can be accessed. Format: hdfs://<nameservice> or hdfs://<namenode_host>:<port>
jobTracker = Variable to define the resource manager address in case of Yarn implementation. Format: <resourcemanager_hostname>:<port>
queueName = Name of the queue as defined by Capacity Scheduler, Fail Scheduler etc. By default, it's "default".
examplesRoot = Environment variable for the workflow.
oozie.wf.application.path = Environment variable which defines the path on HDFS which holds the workflow.xml to be executed.
outputDir = Variable to define the output directory

Note: You can define the parameter, oozie.libpath under which all the libraries required for the mapreduce program can be stored. However, in this example we donot use this.
Ex: oozie.libpath=${nameNode}/$(user.name)/share/lib
4) Create a workflow.xml. workflow.xml defines a set of actions to be performed as a sequence or in control dependency DAG (Direct Acyclic Graph).  "control dependency" from one action to another means that the second action can't run until the first action has completed. 
Refer to the documentation at http://oozie.apache.org/docs/3.3.2/WorkflowFunctionalSpec.html for details
<workflow-app xmlns="uri:oozie:workflow:0.1" name="map-reduce-wf">
 <start to="mr-node"/>
 <action name="mr-node">
     <map-reduce>
       <job-tracker>${jobTracker}</job-tracker>
       <name-node>${nameNode}</name-node>
       <prepare>
         <delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/${outputDir}"/>
       </prepare>
 
   <configuration>
     <property>
       <name>mapred.mapper.new-api</name>
       <value>true</value>
     </property>
     <property>
       <name>mapred.reducer.new-api</name>
       <value>true</value>
     </property>
     <property>
       <name>mapred.job.queue.name</name>
       <value>${queueName}</value>
     </property>
     <property>
       <name>mapreduce.map.class</name>
       <value>org.apache.hadoop.examples.WordCount$TokenizerMapper</value>
     </property>
     <property>
       <name>mapreduce.reduce.class</name>
       <value>org.apache.hadoop.examples.WordCount$IntSumReducer</value>
     </property>
     <property>
       <name>mapreduce.combine.class</name>
       <value>org.apache.hadoop.examples.WordCount$IntSumReducer</value>
     </property>
     <property>
       <name>mapred.output.key.class</name>
       <value>org.apache.hadoop.io.Text</value>
     </property>
     <property>
       <name>mapred.output.value.class</name>
       <value>org.apache.hadoop.io.IntWritable</value>
     </property>
     <property>
       <name>mapred.input.dir</name>
       <value>/user/${wf:user()}/${examplesRoot}/input-data/text</value>
     </property>
     <property>
       <name>mapred.output.dir</name>
       <value>/user/${wf:user()}/${examplesRoot}/output-data/${outputDir}</value>
     </property>
   </configuration>
  </map-reduce>
  <ok to="end"/>
  <error to="fail"/>
 </action>
   <kill name="fail">
   <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
   </kill>
   <end name="end"/>
</workflow-app>
5. Create a directory on HDFS under which all the files related to the Oozie job will be kept. In this directory, push the workflow.xml created in previous step.
[hadoop@hdm1 map-reduce]$ hdfs dfs -mkdir -p /user/hadoop/examplesoozie/map-reduce
[hadoop@hdm1 map-reduce]$ hdfs dfs -copyFromLocal workflow.xml /user/hadoop/examplesoozie/map-reduce/workflow.xml
6. Now under the directory created for the Oozie job, create a folder named lib in which are the required library / jar files will be kept.
[hadoop@hdm1 map-reduce]$ hdfs dfs -mkdir -p /user/hadoop/examplesoozie/map-reduce/lib
7. Once the directory is created, copy hadoop mapreduce examples jar under this directory.  
[hadoop@hdm1 map-reduce]$ hdfs dfs -copyFromLocal /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar /user/hadoop/examplesoozie/map-reduce/lib/hadoop-mapreduce-examples.jar
8. Now you can execute the workflow created and use it to run hadoop mapreduce program for wordcount
[hadoop@hdm1 ~]$ oozie job -oozie http://localhost:11000/oozie -config examplesoozie/map-reduce/job.properties -run
9. You can view the status of the job as shown below:
[hadoop@hdm1 ~]$ oozie job -oozie http://localhost:11000/oozie -info 0000009-140529162032574-oozie-oozi-W
Job ID : 0000009-140529162032574-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : map-reduce-wf
App Path      : hdfs://phdha/user/hadoop/examplesoozie/map-reduce
Status        : SUCCEEDED
Run           : 0
User          : hadoop
Group         : -
Created       : 2014-05-30 00:31 GMT
Started       : 2014-05-30 00:31 GMT
Last Modified : 2014-05-30 00:32 GMT
Ended         : 2014-05-30 00:32 GMT
CoordAction ID: -

Actions
------------------------------------------------------------------------------------------------------------------------------------
ID                                                                            Status    Ext ID                 Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0000009-140529162032574-oozie-oozi-W@:start:                                  OK        -                      OK         -
------------------------------------------------------------------------------------------------------------------------------------
0000009-140529162032574-oozie-oozi-W@mr-node                                  OK        job_1401405229971_0022 SUCCEEDED  -
------------------------------------------------------------------------------------------------------------------------------------
0000009-140529162032574-oozie-oozi-W@end                                      OK        -                      OK         -
------------------------------------------------------------------------------------------------------------------------------------
10. Once the job is completed, you can review the output in the directory as specified by workflow.xml.
[hadoop@hdm1 ~]$ hdfs dfs -cat /user/hadoop/examplesoozie/output-data/map-reduce/part-r-00000
SSH:/var/empty/sshd:/sbin/nologin 1
Server:/var/lib/pgsql:/bin/bash 1
User:/var/ftp:/sbin/nologin 1
Yarn:/home/yarn:/sbin/nologin 1
adm:x:3:4:adm:/var/adm:/sbin/nologin 1
bin:x:1:1:bin:/bin:/sbin/nologin 1
console 1
daemon:x:2:2:daemon:/sbin:/sbin/nologin 1
ftp:x:14:50:FTP 1
games:x:12:100:games:/usr/games:/sbin/nologin 1
gopher:x:13:30:gopher:/var/gopher:/sbin/nologin 1
gpadmin:x:500:500::/home/gpadmin:/bin/bash 1
hadoop:x:503:501:Hadoop:/home/hadoop:/bin/bash 1
halt:x:7:0:halt:/sbin:/sbin/halt 1
Miscellaneous
1. You can see exception like below if workflow.xml file is not specified correctly. For instance, in workflow.xml if mapreduce.map.class is spelled incorrectly as mapreduce.mapper.class, and mapreduce.reduce.class as mapreduce.reducer.class.
2014-05-29 16:29:26,870 ERROR [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
2014-05-29 16:29:26,874 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
2. Refer to the documentation at to view several examples on workflow.xml https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases
3. wordcount.java program for reference
package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper 
       extends Mapper<Object, Text, Text, IntWritable>{
    
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
      
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }
  
  public static class IntSumReducer 
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable values, 
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount  ");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

8 comments:

  1. Managing a business data is not an easy thing, it is very complex process to handle the corporate information both Hadoop and cognos doing this in a easy manner with help of business software suite, thanks for sharing this useful post….
    Regards,
    cognos Training in Chennai|cognos Training|cognos tm1 Training in Chennai

    ReplyDelete
  2. The strategy you posted was nice. The people who want to shift their career to the IT sector then it is the right option to go with the ethical hacking course.
    Ethical hacking course in Chennai | Ethical hacking training in chennai

    ReplyDelete
  3. After reading this blog i very strong in this topics and this blog really helpful to all... Big data hadoop online training Hyderabad

    ReplyDelete
  4. Thanks for making me this article. You have done a great job by sharing this content in here. Keep writing article like this.

    Cloud Training
    Cloud Training in Chennai

    ReplyDelete
  5. You rock particularly for the high caliber and results-arranged offer assistance. I won't reconsider to embrace your blog entry to anyone who needs and needs bolster about this region.
    iosh course in chennai

    ReplyDelete
  6. thanks for sharing the content, it helped me a lot oracle training in chennai

    ReplyDelete