In this article we will go through the steps to use Oozie for executing map reduce programs / jar files provided by hadoop distribution. You may use this as a base and setup your applications / workflow accordingly
Environment
- PHD 1.1.1 / Oozie 3.3.2
Prerequisite
- Basic Oozie knowledge & Working Oozie, PHD cluster
Example used
- Wordcount program from hadoop-mapreduce-examples.jar
- hadoop-mapreduce-examples.jar is a symbolic link to the jar provided in the Pivotal Hadoop Distribution.
Below is the list of steps
1) Untar the jar hadoop-mapreduce-examples.jar. You can find this jar under /usr/lib/gphd/hadoop-mapreduce directory on a Pivotal Hadoop cluster.
[hadoop@hdm1 test]$ jar xf hadoop-mapreduce-examples.jar [hadoop@hdm1 test]$ ls hadoop-mapreduce-examples.jar META-INF org
2) Navigate to the directory to see the list of class files associated with WordCount.
[hadoop@hdm1 test]$ cd org/apache/hadoop/examples/ [hadoop@hdm1 examples]$ ls WordCount* WordCount.class WordCount$IntSumReducer.class WordCount$TokenizerMapper.class
In WordCount program, name of the mapper class is WordCount$TokenizerMapper.class and reducer class is WordCount$TokenizerMapper.class. We will use these file when defining the oozie workflow.xml
3) Create a job.properties file. The parameters for the Oozie job are provided in a Java properties file (.properties) or a Hadoop configuration xml (.xml), in this case we use a .properties file.
nameNode=hdfs://phdha jobTracker=hdm1.phd.local:8032 queueName=default examplesRoot=examplesoozie oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/map-reduce outputDir=map-reduce where: namenode = Variable to define the namenode path by which HDFS can be accessed. Format: hdfs://<nameservice> or hdfs://<namenode_host>:<port> jobTracker = Variable to define the resource manager address in case of Yarn implementation. Format: <resourcemanager_hostname>:<port> queueName = Name of the queue as defined by Capacity Scheduler, Fail Scheduler etc. By default, it's "default". examplesRoot = Environment variable for the workflow. oozie.wf.application.path = Environment variable which defines the path on HDFS which holds the workflow.xml to be executed. outputDir = Variable to define the output directory Note: You can define the parameter, oozie.libpath under which all the libraries required for the mapreduce program can be stored. However, in this example we donot use this. Ex: oozie.libpath=${nameNode}/$(user.name)/share/lib
4) Create a workflow.xml. workflow.xml defines a set of actions to be performed as a sequence or in control dependency DAG (Direct Acyclic Graph). "control dependency" from one action to another means that the second action can't run until the first action has completed.
Refer to the documentation at http://oozie.apache.org/docs/3.3.2/WorkflowFunctionalSpec.html for details
<workflow-app xmlns="uri:oozie:workflow:0.1" name="map-reduce-wf"> <start to="mr-node"/> <action name="mr-node"> <map-reduce> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/user/${wf:user()}/${examplesRoot}/output-data/${outputDir}"/> </prepare> <configuration> <property> <name>mapred.mapper.new-api</name> <value>true</value> </property> <property> <name>mapred.reducer.new-api</name> <value>true</value> </property> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> <property> <name>mapreduce.map.class</name> <value>org.apache.hadoop.examples.WordCount$TokenizerMapper</value> </property> <property> <name>mapreduce.reduce.class</name> <value>org.apache.hadoop.examples.WordCount$IntSumReducer</value> </property> <property> <name>mapreduce.combine.class</name> <value>org.apache.hadoop.examples.WordCount$IntSumReducer</value> </property> <property> <name>mapred.output.key.class</name> <value>org.apache.hadoop.io.Text</value> </property> <property> <name>mapred.output.value.class</name> <value>org.apache.hadoop.io.IntWritable</value> </property> <property> <name>mapred.input.dir</name> <value>/user/${wf:user()}/${examplesRoot}/input-data/text</value> </property> <property> <name>mapred.output.dir</name> <value>/user/${wf:user()}/${examplesRoot}/output-data/${outputDir}</value> </property> </configuration> </map-reduce> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Map/Reduce failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
5. Create a directory on HDFS under which all the files related to the Oozie job will be kept. In this directory, push the workflow.xml created in previous step.
[hadoop@hdm1 map-reduce]$ hdfs dfs -mkdir -p /user/hadoop/examplesoozie/map-reduce
[hadoop@hdm1 map-reduce]$ hdfs dfs -copyFromLocal workflow.xml /user/hadoop/examplesoozie/map-reduce/workflow.xml
6. Now under the directory created for the Oozie job, create a folder named lib in which are the required library / jar files will be kept.
[hadoop@hdm1 map-reduce]$ hdfs dfs -mkdir -p /user/hadoop/examplesoozie/map-reduce/lib
7. Once the directory is created, copy hadoop mapreduce examples jar under this directory.
[hadoop@hdm1 map-reduce]$ hdfs dfs -copyFromLocal /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar /user/hadoop/examplesoozie/map-reduce/lib/hadoop-mapreduce-examples.jar
8. Now you can execute the workflow created and use it to run hadoop mapreduce program for wordcount
[hadoop@hdm1 ~]$ oozie job -oozie http://localhost:11000/oozie -config examplesoozie/map-reduce/job.properties -run
9. You can view the status of the job as shown below:
[hadoop@hdm1 ~]$ oozie job -oozie http://localhost:11000/oozie -info 0000009-140529162032574-oozie-oozi-W
Job ID : 0000009-140529162032574-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : map-reduce-wf
App Path : hdfs://phdha/user/hadoop/examplesoozie/map-reduce
Status : SUCCEEDED
Run : 0
User : hadoop
Group : -
Created : 2014-05-30 00:31 GMT
Started : 2014-05-30 00:31 GMT
Last Modified : 2014-05-30 00:32 GMT
Ended : 2014-05-30 00:32 GMT
CoordAction ID: -
Actions
------------------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0000009-140529162032574-oozie-oozi-W@:start: OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0000009-140529162032574-oozie-oozi-W@mr-node OK job_1401405229971_0022 SUCCEEDED -
------------------------------------------------------------------------------------------------------------------------------------
0000009-140529162032574-oozie-oozi-W@end OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
10. Once the job is completed, you can review the output in the directory as specified by workflow.xml.
[hadoop@hdm1 ~]$ hdfs dfs -cat /user/hadoop/examplesoozie/output-data/map-reduce/part-r-00000
SSH:/var/empty/sshd:/sbin/nologin 1
Server:/var/lib/pgsql:/bin/bash 1
User:/var/ftp:/sbin/nologin 1
Yarn:/home/yarn:/sbin/nologin 1
adm:x:3:4:adm:/var/adm:/sbin/nologin 1
bin:x:1:1:bin:/bin:/sbin/nologin 1
console 1
daemon:x:2:2:daemon:/sbin:/sbin/nologin 1
ftp:x:14:50:FTP 1
games:x:12:100:games:/usr/games:/sbin/nologin 1
gopher:x:13:30:gopher:/var/gopher:/sbin/nologin 1
gpadmin:x:500:500::/home/gpadmin:/bin/bash 1
hadoop:x:503:501:Hadoop:/home/hadoop:/bin/bash 1
halt:x:7:0:halt:/sbin:/sbin/halt 1
Miscellaneous
1. You can see exception like below if workflow.xml file is not specified correctly. For instance, in workflow.xml if mapreduce.map.class is spelled incorrectly as mapreduce.mapper.class, and mapreduce.reduce.class as mapreduce.reducer.class.
2014-05-29 16:29:26,870 ERROR [main] org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:hadoop (auth:SIMPLE) cause:java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
2014-05-29 16:29:26,874 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.LongWritable
2. Refer to the documentation at to view several examples on workflow.xml https://github.com/yahoo/oozie/wiki/Oozie-WF-use-cases
3. wordcount.java program for reference
package org.apache.hadoop.examples;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount ");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
Managing a business data is not an easy thing, it is very complex process to handle the corporate information both Hadoop and cognos doing this in a easy manner with help of business software suite, thanks for sharing this useful post….
ReplyDeleteRegards,
cognos Training in Chennai|cognos Training|cognos tm1 Training in Chennai
Maharashtra Police Patil Recruitment 2016
ReplyDeleteI am so grateful for your blog post.......
The strategy you posted was nice. The people who want to shift their career to the IT sector then it is the right option to go with the ethical hacking course.
ReplyDeleteEthical hacking course in Chennai | Ethical hacking training in chennai
After reading this blog i very strong in this topics and this blog really helpful to all... Big data hadoop online training Hyderabad
ReplyDeleteThanks for making me this article. You have done a great job by sharing this content in here. Keep writing article like this.
ReplyDeleteCloud Training
Cloud Training in Chennai
You rock particularly for the high caliber and results-arranged offer assistance. I won't reconsider to embrace your blog entry to anyone who needs and needs bolster about this region.
ReplyDeleteiosh course in chennai
Great post very useful info thanks for this post ....
ReplyDeleteAws training chennai | AWS course in chennai
Rpa training in chennai | RPA training course chennai
sas training in chennai | sas training class in chennai
thanks for sharing the content, it helped me a lot oracle training in chennai
ReplyDelete