Friday, 11 September 2015

Apache Falcon – Defining First Process

In last two posts (post 1, post 2) I provided basic introduction to Apache Falcon, in this post I will describe how we can write a basic Falcon data pipeline.
The Falcon process which I am going to describe about triggers on achieving two conditions –
  1.  Process start time (i.e. 15:00 UTC) is met.
  2. And a trigger folder is created in location /tmp/feed-01/ with name as ${YEAR}-${MONTH}-${DAY}.

Once the Falcon process is triggered it invokes an Oozie workflow which calls a SSH script which just prints the two input parameters to /tmp/demo.out file on local FS of SSH box.

The code for Falcon cluster (test-primary-cluster) is –

One important thing to note here is you need to create staging and working directories on HDFS with proper permission and ownership. The below permissions and ownership are needed on Hortonworks cluster –

hadoop fs -mkdir -p /apps/falcon/test-primary-cluster/staging/
hadoop fs -chmod 777 /apps/falcon/test-primary-cluster/staging/
hadoop fs -mkdir -p /apps/falcon/test-primary-cluster/working/
hadoop fs -chmod 755 /apps/falcon/test-primary-cluster/working/
hadoop fs -chown -R falcon:hadoop /apps/falcon/test-primary-cluster

The code for Falcon feed (feed-01-trigger) is –

For this feed -
  • The retention limit is set to 9999 months.
  • Late arrival limit is set to 20 hours.
  • And frequency is set to daily.

The code for Falcon process (process-01) is –

For this process -
  • The start time is set at 15:00 UTC.
  • Dependency is set to input feed feed-01-trigger.
  • Retry policy is set to 2 times with a gap of 15 minutes.
  • This process is also using EL expression to set input2 variable to get yesterday's date.

The oozie workflow with SSH action is as defined below –

This Oozie workflow -
  • Gets input1, input2 and workflowName variable from Falcon proces-01 process.
  • And invokes shell script on poc001 box with input1 and input2 as parameters.
And demo.bash script called by Oozie SSH action is given below –

demo.bash is a simple script which echos current date, input1 and input2 variable to /tmp/demo.out file.

In my next post I will explain how we can submit and schedule these Falcon process.

No comments:

Post a Comment