Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce and Pig jobs
Oozie have three levels of meaning:
A server based workflow engine , a server based Coordinator Engine and a server based Bundle Engine .Oozie is a Java Web-Application that runs in a Java servlet-container
Oozie can store and run different type of hadoop jobs(mapreduce,hive,pig,and so on),can run workflow jobs based on time and data triggers,also can manage batch coordinator applications.
Oozie has been designed to scale, and it can manage the timely execution of thousands of workflow in a Hadoop cluster, each composed of possibly dozens of constituent jobs. Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language).
hPDL is a fairly compact language, using a limited amount of flow control and action nodes. Control nodes define the flow of execution and include beginning and end of a workflow (start, end and fail nodes) and mechanisms to control the workflow execution path ( decision, fork and join nodes). Action nodes are the mechanism by which a workflow triggers the execution of a computation/processing task. Oozie provides support for the following types of actions: Hadoop map-reduce, Hadoop file system, Pig, Java and Oozie sub-workflow (SSH action is removed as of Oozie schema 0.2).
All computation/processing tasks triggered by an action node are remote to Oozie - they are executed by Hadoop Map/Reduce framework. This approach allows Oozie to leverage existing Hadoop machinery for load balancing, fail over, etc. The majority of these tasks are executed asynchronously (the exception is the file system action that is handled synchronously). This means that for most types of computation/processing tasks triggered by workflow action, the workflow job has to wait until the computation/processing task completes before transitioning to the following node in the workflow. Oozie can detect completion of computation/processing tasks by two different means, callbacks and polling. When a computation/processing tasks is started by Oozie, Oozie provides a unique callback URL to the task, the task should invoke the given URL to notify its completion. For cases that the task failed to invoke the callback URL for any reason (i.e. a transient network failure) or when the type of task cannot invoke the callback URL upon completion, Oozie has a mechanism to poll computation/processing tasks for completion.
Oozie workflows can be parameterized (using variables like ${inputDir} within the workflow definition). When submitting a workflow job values for the parameters must be provided. If properly parameterized (i.e. using different output directories) several identical workflow jobs can concurrently.
Some of the workflows are invoked on demand, but the majority of times it is necessary to run them based on regular time intervals and/or data availability and/or external events. The Oozie Coordinator system allows the user to define workflow execution schedules based on these parameters. Oozie coordinator allows to model workflow execution triggers in the form of the predicates, which can reference to data, time and/or external events. The workflow job is started after the predicate is satisfied.
It is also often necessary to connect workflow jobs that run regularly, but at different time intervals. The outputs of multiple subsequent runs of a workflow become the input to the next workflow. Chaining together these workflows result it is referred as a data application pipeline. Oozie coordinator support creation of such data Application pipelines.
Installing Oozie
Step-1: Prerequisites
You can follow the instruction provides by oozie office website,to match the right version hadoop stack software. In this tutorial we using oozie version is 3.0.2 which accesses available on github ,its System Requirements as follow:
-Unix (tested in Linux and Mac OS X) .We used Ubuntu lucid– Server Version in .
-Java 1.6+
-Hadoop
-Apache Hadoop (tested with 0.20.2)
-Yahoo! Hadoop (tested with 0.20.104.2)
-ExtJS library (optional, to enable Oozie webconsole)
-ExtJS 2.2
Step-2: Server Installation
-Download or build an Oozie binary distribution https://github.com/yahoo/oozie/downloads
-Download a Hadoop binary distribution http://www.us.apache.org/dist/hadoop/common/hadoop-0.20.2/
-Download ExtJS library (it must be version 2.2) http://extjs.com/deploy/ext-2.2.zip
-Expand two packages –oozie and hadoop distribution tar.gz as the oozie Unix user which recommended by office document in server installation .Commands as shown below:
->oozie@dm4:~$ tar zxvf oozie-3.0.2-distro.tar.gz -C {oozie home}
->oozie@dm4:~$ tar zxvf hadoop-0.20.2.tar.gz -C {hadoop home}
-Make up oozie.war.Oozie should run on hadoop but its distribution bundle without hadoop jar files and without the ExtJS library(because of they under different licenses ).We have to run oozie setup shell to pack the required hadoop jar files and optional ExtJS library so as to enable the Oozie web-console. Oozie Server scripts run only under the Unix user that owns the Oozie installation directory, if necessary use sudo -u OOZIE_USER when invoking the scripts.Commands as shown below:
->$ bin/oozie-setup.sh -hadoop 0.20.2 ${HADOOP_HOME} -extjs /tmp/ext-2.2.zip
-Start up oozie and edit oozie configuration.To start Oozie as a daemon process run:
->$ bin/oozie-start.sh
-Using the Oozie command line tool check the status of Oozie:
->$ bin/oozie admin -oozie http://localhost:11000/oozie -status
Using a browser go to the Oozie web console , Oozie status should be NORMAL .If the status is HTTP 404 Not Found,you can edit the configuration file to fix it.Open conf/oozie-default.xml with vim,copy the property “oozie.services” into oozie-site.xml. In oozie-site.xml’s ”oozie.services” property,one of the service name is “KerberosHadoopAccessorService” .Remove only ”Kerberos” which will make it “HadoopAccessorService”. And then restart oozie.