Monday, May 25, 2015

Oozie - A Workflow Scheduler

What is Oozie ?
  • Oozie is a server based Workflow Engine specialized in running workflow
  • A workflow is a collection of nodes arranged in a control dependency DAG (Direct Acyclic Graph).
  • Workflow nodes are classified in 2 types:
    • Control flow nodes: Nodes that control the start and end of the workflow and workflow job execution path.
    • Action nodes: Nodes that trigger the execution of a computation/processing task.
  • Oozie workflows definitions are written in hPDL (a XML Process Definition Language)
  • Oozie supports actions to execute jobs for below hadoop components:
    • Apache MapReduce
    • Apache Pig
    • Apache Hive
    • Apache Sqoop
    • System specific jobs (java programs, shell scripts etc.)
  • Oozie is installed on top of Hadoop, thus hadoop components must be installed and running
Oozie - What are it’s components ?
  • Tomcat server - It is a web server including a Tomcat JSP engine and a variety of different connectors, but its core component is called Catalina. Catalina provides Tomcat's actual implementation of the servlet specification, when you startup your Tomcat server, you're actually starting Catalina.
  • Database: Oozie stores workflow jobs details in the database.
    • Default: Derby
    • Others supported: HSQL, MySQL, Oracle and PostgreSQL
  • ExtJs - Optional, It is a Javascript application framework for building up interactive web applications. It is used to enable Oozie web console UI. Oozie workflow jobs details can be viewed using Oozie Web UI.
  • Client CLI - Utility providing command line functions to interact with Oozie. It is used to submit, monitor and control Oozie jobs.
Oozie - Derby database & Web Console UI
  • Oozie uses Apache Derby RDBS to store workflow execution details.
  • By default, logs for derby database are stored under /var/log/gphd/oozie/derby.log
  • “ij” is an interactive SQL scripting tool that can be used to connect to derby
  • How to connect to derby database: /var/lib/gphd/oozie/oozie-db is the database path
[gpadmin@pccadmin]$ sudo -u oozie /usr/java/jdk1.7.0_45/db/bin/ij
ij version 10.8
ij> connect 'jdbc:derby:/var/lib/gphd/oozie/oozie-db'
  • Oozie Web Console UI pulls the details from configured (derby) database. Login to database is not required.
Putting it all together !!
  • User installs Oozie, configures it with an appropriate database
  • User creates a workflow xml and required configurations files which specifies the job properties and type of action to be used
  • User submits the job to the oozie server using Oozie CLI
  • User can monitor the status of job execution using Oozie Web Console UI or Oozie CLI options
Let’s understand a workflow !!
Workflow consists of 2 types of nodes:
  1. Action nodes to execute below task: The below actions support invoking a job for the respecitive application. For example, hive action is used to invoke a hive sql script.
    • Map Reduce
    • Pig
    • Hive
    • FS (HDFS)
    • Sub-Workflow
    • Java program
    • ssh
  1. Control nodes to provide below functions
    • Start : Entry point for a workflow job, it indicates the first workflow node the workflow job must transition to.
    • End : The end for a workflow job, it indicates that the workflow job has completed successfully.
    • Kill : The kill node allows a workflow job to kill itself. When a workflow job reaches the kill it finishes in error (KILLED).
    • Decision : A decision node enables a workflow to make a selection on the execution path to follow. The behavior of a decision node can be seen as a switch-case statement.
    • Fork and join : A fork node splits one path of execution into multiple concurrent paths of execution. A join node waits until every concurrent execution path of a previous fork node arrives to it.
The below diagram is an attempt to showcase some of the action and control nodes in a diagrammatical representation.
What we need to build an application workflow !!
1. On local filesystem :
  • job.properties
2. On HDFS :
  • Application workflow directory
    • All configuration files and scripts (Pig, hive, shell) needed by the workflow action nodes should be under the application HDFS directory.
    • workflow.xml  
      • Controls the execution of the application
    • “lib” named directory (Optional)
      • Any additional jar required for program execution is provided by user in this directory. During execution, all the jars available under lib are added to the classpath

Here is a an example with a sub workflow with hive action.
Ex 1 : Create a workflow to execute the below 2 actions:
  • Execute a FS action (Main Workflow)
  • Execute a sub workflow with hive action (Sub Workflow)
In order to perform the above task, we will need to create the below files:
  1. job.properties (On local filesystem)
  2. workflow.xml for Main workflow (On HDFS)
  3. workflow.xml for Sub Workflow (On HDFS)
  4. script.q (A text file with Hive SQL)
Once the workflow.xml files are created, they should be put under required directories at hdfs.
job.properties - Let's have it under examples/apps/hive/
nameNode=hdfs://test
jobTracker=pccadmin.phd.local:8032
queueName=default
examplesRoot=examples
oozie.use.system.libpath=true
#queueName=batch
oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/hive
workflow.xml (Parent workflow)
hdfs dfs -ls /home/gpadmin/examples/apps/hive/workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2.5" name="hive-wf">
  <start to="pwf"/>
  <action name='pwf'>
  <fs>
      <delete path="${nameNode}/user/gpadmin/examples/output-data/hivesub"/>
  </fs>
      <ok to="swf"/>
     <error to="fail"/>
  </action>
  <action name='swf'>
     <sub-workflow>
        <app-path>${nameNode}/user/gpadmin/${examplesRoot}/apps/hivesub</app-path>
        <propagate-configuration/>
     </sub-workflow>
     <ok to="end"/>
     <error to="fail"/>
   </action>  <kill name="fail">
      <message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
   </kill>
  <end name="end"/>
</workflow-app>
workflow.xml (sub workflow) 
hdfs dfs -ls /home/gpadmin/examples/apps/hivesub/workflow.xml
<workflow-app xmlns="uri:oozie:workflow:0.2.5" name="start-swf">
    <credentials>
        <credential name='hive_auth' type='hcat'>
        <property>
              <name>hcat.metastore.uri</name>
              <value>thrift://hdw2.phd.local:9083</value>
        </property>
        <property>
              <name>hcat.metastore.principal</name>
              <value>hive/_HOST@MYREALM</value>
        </property>
       </credential>
   </credentials>
         <start to="hive-node"/>
   <action name="hive-node" cred='hive_auth' >
         <hive xmlns="uri:oozie:hive-action:0.2" >
               <job-tracker>${jobTracker}</job-tracker>
               <name-node>${nameNode}</name-node>
               <prepare>
                     <delete path="${nameNode}/user/gpadmin/examples/output-data/hivesub"/>
                     <mkdir path="${nameNode}/user/gpadmin/examples/output-data"/>
               </prepare>
                <job-xml>${nameNode}/user/oozie/hive-oozie-site.xml</job-xml>
                <configuration>
                      <property>
                           <name>mapred.job.queue.name</name>
                           <value>${queueName}</value>
                       </property>
                 </configuration>
                 <script>script.q</script>
                 <param>INPUT=${nameNode}/user/gpadmin/examples/input-data/table</param>
                 <param>OUTPUT=${nameNode}/user/gpadmin/examples/output-data/hive</param>
         </hive>
         <ok to="end"/>
         <error to="fail"/>
      </action>  <kill name="fail">
       <message>Hive failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
   </kill>
 <end name="end"/>
</workflow-app>
[gpadmin@pccadmin ~]$ hdfs dfs -cat examples/apps/hivesub/script.q
DROP Table if exists testa;
CREATE EXTERNAL TABLE testa (a INT) STORED AS TEXTFILE LOCATION '${INPUT}';
INSERT OVERWRITE DIRECTORY '${OUTPUT}' SELECT * FROM testa;
[gpadmin@pccadmin ~]$ oozie job -oozie http://localhost:11000/oozie -config examples/apps/hive/job.properties -run
job: 0000000-140709224809456-oozie-oozi-W
[gpadmin@pccadmin ~]$ oozie job -oozie http://localhost:11000/oozie -info 0000000-140709224809456-oozie-oozi-W
Job ID : 0000000-140709224809456-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : hive-wf
App Path : hdfs://test/user/gpadmin/examples/apps/hive
Status : RUNNING
Run : 0
User : gpadmin
Group : -
Created : 2014-07-10 06:58 GMT
Started : 2014-07-10 06:58 GMT
Last Modified : 2014-07-10 06:58 GMT
Ended : -
CoordAction ID: -
Actions
------------------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0000000-140709224809456-oozie-oozi-W@:start: OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0000000-140709224809456-oozie-oozi-W@pwf OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0000000-140709224809456-oozie-oozi-W@swf RUNNING 0000001-140709224809456-oozie-oozi-W- -
------------------------------------------------------------------------------------------------------------------------------------
[gpadmin@pccadmin ~]$ oozie job -oozie http://localhost:11000/oozie -info 0000000-140709224809456-oozie-oozi-W
Job ID : 0000000-140709224809456-oozie-oozi-W
------------------------------------------------------------------------------------------------------------------------------------
Workflow Name : hive-wf
App Path : hdfs://test/user/gpadmin/examples/apps/hive
Status : SUCCEEDED
Run : 0
User : gpadmin
Group : -
Created : 2014-07-10 06:58 GMT
Started : 2014-07-10 06:58 GMT
Last Modified : 2014-07-10 06:59 GMT
Ended : 2014-07-10 06:59 GMT
CoordAction ID: -
Actions
------------------------------------------------------------------------------------------------------------------------------------
ID Status Ext ID Ext Status Err Code
------------------------------------------------------------------------------------------------------------------------------------
0000000-140709224809456-oozie-oozi-W@:start: OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0000000-140709224809456-oozie-oozi-W@pwf OK - OK -
------------------------------------------------------------------------------------------------------------------------------------
0000000-140709224809456-oozie-oozi-W@swf OK 0000001-140709224809456-oozie-oozi-WSUCCEEDED -
------------------------------------------------------------------------------------------------------------------------------------
0000000-140709224809456-oozie-oozi-W@end OK - OK -
------------------------------------------------------------------------------------------------------------------------------------

1 comment:

  1. After reading this blog i very strong in this topics and this blog really helpful to all. Big Data Hadoop Online Training Hyderabad

    ReplyDelete