What is Oozie ?
- Oozie is a server based Workflow Engine specialized in running workflow
- A workflow is a collection of nodes arranged in a control dependency DAG (Direct Acyclic Graph).
- Workflow nodes are classified in 2 types:
- Control flow nodes: Nodes that control the start and end of the workflow and workflow job execution path.
- Action nodes: Nodes that trigger the execution of a computation/processing task.
- Oozie workflows definitions are written in hPDL (a XML Process Definition Language)
- Oozie supports actions to execute jobs for below hadoop components:
- Apache MapReduce
- Apache Pig
- Apache Hive
- Apache Sqoop
- System specific jobs (java programs, shell scripts etc.)
- Oozie is installed on top of Hadoop, thus hadoop components must be installed and running
Oozie - What are it’s components ?
- Tomcat server - It is a web server including a Tomcat JSP engine and a variety of different connectors, but its core component is called Catalina. Catalina provides Tomcat's actual implementation of the servlet specification, when you startup your Tomcat server, you're actually starting Catalina.
- Database: Oozie stores workflow jobs details in the database.
- Default: Derby
- Others supported: HSQL, MySQL, Oracle and PostgreSQL
- ExtJs - Optional, It is a Javascript application framework for building up interactive web applications. It is used to enable Oozie web console UI. Oozie workflow jobs details can be viewed using Oozie Web UI.
- Client CLI - Utility providing command line functions to interact with Oozie. It is used to submit, monitor and control Oozie jobs.
Oozie - Derby database & Web Console UI
- Oozie uses Apache Derby RDBS to store workflow execution details.
- By default, logs for derby database are stored under /var/log/gphd/oozie/derby.log
- “ij” is an interactive SQL scripting tool that can be used to connect to derby
- How to connect to derby database: /var/lib/gphd/oozie/oozie-db is the database path
[gpadmin@pccadmin]$ sudo -u oozie /usr/java/jdk1.7.0_45/db/bin/ij ij version 10.8 ij> connect 'jdbc:derby:/var/lib/gphd/oozie/oozie-db'
- Oozie Web Console UI pulls the details from configured (derby) database. Login to database is not required.
Putting it all together !!
- User installs Oozie, configures it with an appropriate database
- User creates a workflow xml and required configurations files which specifies the job properties and type of action to be used
- User submits the job to the oozie server using Oozie CLI
- User can monitor the status of job execution using Oozie Web Console UI or Oozie CLI options
Workflow consists of 2 types of nodes:
- Action nodes to execute below task: The below actions support invoking a job for the respecitive application. For example, hive action is used to invoke a hive sql script.
- Map Reduce
- Pig
- Hive
- FS (HDFS)
- Sub-Workflow
- Java program
- ssh
- Control nodes to provide below functions
- Start : Entry point for a workflow job, it indicates the first workflow node the workflow job must transition to.
- End : The end for a workflow job, it indicates that the workflow job has completed successfully.
- Kill : The kill node allows a workflow job to kill itself. When a workflow job reaches the kill it finishes in error (KILLED).
- Decision : A decision node enables a workflow to make a selection on the execution path to follow. The behavior of a decision node can be seen as a switch-case statement.
- Fork and join : A fork node splits one path of execution into multiple concurrent paths of execution. A join node waits until every concurrent execution path of a previous fork node arrives to it.
The below diagram is an attempt to showcase some of the action and control nodes in a diagrammatical representation.
1. On local filesystem :
- job.properties
2. On HDFS :
- Application workflow directory
- All configuration files and scripts (Pig, hive, shell) needed by the workflow action nodes should be under the application HDFS directory.
- workflow.xml
- Controls the execution of the application
- “lib” named directory (Optional)
- Any additional jar required for program execution is provided by user in this directory. During execution, all the jars available under lib are added to the classpath
Here is a an example with a sub workflow with hive action.
Ex 1 : Create a workflow to execute the below 2 actions:
- Execute a FS action (Main Workflow)
- Execute a sub workflow with hive action (Sub Workflow)
In order to perform the above task, we will need to create the below files:
- job.properties (On local filesystem)
- workflow.xml for Main workflow (On HDFS)
- workflow.xml for Sub Workflow (On HDFS)
- script.q (A text file with Hive SQL)
Once the workflow.xml files are created, they should be put under required directories at hdfs.
job.properties - Let's have it under examples/apps/hive/
nameNode=hdfs://test jobTracker=pccadmin.phd.local:8032 queueName=default examplesRoot=examples oozie.use.system.libpath=true #queueName=batch oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/hive
workflow.xml (Parent workflow)
hdfs dfs -ls /home/gpadmin/examples/apps/hive/workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2.5" name="hive-wf"> <start to="pwf"/> <action name='pwf'> <fs> <delete path="${nameNode}/user/gpadmin/examples/output-data/hivesub"/> </fs> <ok to="swf"/> <error to="fail"/> </action> <action name='swf'> <sub-workflow> <app-path>${nameNode}/user/gpadmin/${examplesRoot}/apps/hivesub</app-path> <propagate-configuration/> </sub-workflow> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
workflow.xml (sub workflow)
hdfs dfs -ls /home/gpadmin/examples/apps/hivesub/workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2.5" name="start-swf"> <credentials> <credential name='hive_auth' type='hcat'> <property> <name>hcat.metastore.uri</name> <value>thrift://hdw2.phd.local:9083</value> </property> <property> <name>hcat.metastore.principal</name> <value>hive/_HOST@MYREALM</value> </property> </credential> </credentials> <start to="hive-node"/> <action name="hive-node" cred='hive_auth' > <hive xmlns="uri:oozie:hive-action:0.2" > <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <prepare> <delete path="${nameNode}/user/gpadmin/examples/output-data/hivesub"/> <mkdir path="${nameNode}/user/gpadmin/examples/output-data"/> </prepare> <job-xml>${nameNode}/user/oozie/hive-oozie-site.xml</job-xml> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <script>script.q</script> <param>INPUT=${nameNode}/user/gpadmin/examples/input-data/table</param> <param>OUTPUT=${nameNode}/user/gpadmin/examples/output-data/hive</param> </hive> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
[gpadmin@pccadmin ~]$ hdfs dfs -cat examples/apps/hivesub/script.q
DROP Table if exists testa;
CREATE EXTERNAL TABLE testa (a INT) STORED AS TEXTFILE LOCATION '${INPUT}';
INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM testa;
[gpadmin@pccadmin ~]$ oozie job -oozie http://localhost:11000/oozie -config examples/apps/hive/job.properties -run job: 0000000-140709224809456-oozie-oozi-W
[gpadmin@pccadmin ~]$ oozie job -oozie http://localhost:11000/oozie -info 0000000-140709224809456-oozie-oozi-W Job ID : 0000000-140709224809456-oozie-oozi-W ------------------------------------------------------------------------------------------------------------------------------------ Workflow Name : hive-wf App Path : hdfs://test/user/gpadmin/examples/apps/hive Status : RUNNING Run : 0 User : gpadmin Group : - Created : 2014-07-10 06:58 GMT Started : 2014-07-10 06:58 GMT Last Modified : 2014-07-10 06:58 GMT Ended : - CoordAction ID: - Actions ------------------------------------------------------------------------------------------------------------------------------------ ID Status Ext ID Ext Status Err Code ------------------------------------------------------------------------------------------------------------------------------------ 0000000-140709224809456-oozie-oozi-W@:start: OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0000000-140709224809456-oozie-oozi-W@pwf OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0000000-140709224809456-oozie-oozi-W@swf RUNNING 0000001-140709224809456-oozie-oozi-W- - ------------------------------------------------------------------------------------------------------------------------------------
[gpadmin@pccadmin ~]$ oozie job -oozie http://localhost:11000/oozie -info 0000000-140709224809456-oozie-oozi-W Job ID : 0000000-140709224809456-oozie-oozi-W ------------------------------------------------------------------------------------------------------------------------------------ Workflow Name : hive-wf App Path : hdfs://test/user/gpadmin/examples/apps/hive Status : SUCCEEDED Run : 0 User : gpadmin Group : - Created : 2014-07-10 06:58 GMT Started : 2014-07-10 06:58 GMT Last Modified : 2014-07-10 06:59 GMT Ended : 2014-07-10 06:59 GMT CoordAction ID: - Actions ------------------------------------------------------------------------------------------------------------------------------------ ID Status Ext ID Ext Status Err Code ------------------------------------------------------------------------------------------------------------------------------------ 0000000-140709224809456-oozie-oozi-W@:start: OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0000000-140709224809456-oozie-oozi-W@pwf OK - OK - ------------------------------------------------------------------------------------------------------------------------------------ 0000000-140709224809456-oozie-oozi-W@swf OK 0000001-140709224809456-oozie-oozi-WSUCCEEDED - ------------------------------------------------------------------------------------------------------------------------------------ 0000000-140709224809456-oozie-oozi-W@end OK - OK - ------------------------------------------------------------------------------------------------------------------------------------
After reading this blog i very strong in this topics and this blog really helpful to all. Big Data Hadoop Online Training Hyderabad
ReplyDelete