Data Lab: Apache Falcon - Defining a process dependent on Multiple Hive Tables

For some requirements we might need to use two or more Hive tables as input, to a single Apache Falcon process. In this post I am going to explain how we can define a Falcon process which takes two input tables and stores the output to one Hive table.

Let us assume that there are two input tables, each partitioned by feed_date column. For each feed_date we want to get record count for both the input tables and store table name along-with the record count in an output table which is also partitioned by feed_date.

The Falcon feed for first Hive table is defined as -

And the second feed Hive table is defined as -

Our output Hive table is defined as -

Note: DDL for output Hive table is -

USE tmp_rishav
;
CREATE TABLE rec_count_tbl (tbl_name STRING, cnt INT) partitioned by(feed_date date) stored as textfile
;

The Falcon process which uses two input tables and writes output to one table is -

The above Falcon process invokes a Hive script which is given below -

You might have notice that we are using some variables like falcon_output_database, falcon_input1_database, etc. These variables are defined by Falcon. For a complete list of variables set by Falcon you can check Falcon UI Log for your process and then check "Action Configuration" tab for this Oozie workflow.

Falcon sets up these variables for input Hive feed -
<param>falcon_input1_storage_type=TABLE</param>
<param>falcon_input1_catalog_url=thrift://localhost:9083</param>
<param>falcon_input1_table=table1</param>
<param>falcon_input1_database=tmp_rishav</param>
<param>falcon_input1_filter=(feed_date='2015-09-28')</param>
this input1 part of variable's name is dependent on the name which we give to feed in Falcon process.
If you have defined multiple Hive input feeds with names like input1, input2, input3, etc. you can refer them by replacing input1 in above variable's name.

And for output Hive feed below variables are defined -
<param>falcon_output_table=rec_count_tbl</param>
<param>falcon_output_partitions_java='feed_date=2015-09-28'</param>
<param>falcon_output_catalog_url=thrift://localhost:9083</param>
<param>falcon_output_partitions_pig='feed_date=2015-09-28'</param>
<param>falcon_output_partitions_hive=feed_date='2015-09-28'</param>
<param>falcon_output_dated_partition_value_feed_date=2015-09-28</param>
<param>falcon_output_storage_type=TABLE</param>
<param>falcon_output_database=tmp_rishav</param>
<param>falcon_output_partitions=feed_date='2015-09-28'</param>
again output part of variable's name is dependent on the name which we give to output feed in Falcon process.

Before submitting/scheduling these Falcon entities we need to upload the Hive script to HDFS.
Now we can submit and schedule these Falcon entities.

The commands to submit and schedule these Falcon entities are -

falcon entity -type feed -submit -file input-table-1-feed.xml
falcon entity -type feed -submit -file input-table-2-feed.xml
falcon entity -type feed -submit -file rec-count-tbl-feed.xml
falcon entity -type process -submit -file multi-table-process.xml

falcon entity -type feed -name input-table-1 -schedule
falcon entity -type feed -name input-table-2 -schedule
falcon entity -type feed -name rec-count-tbl -schedule
falcon entity -type process -name multi-table-process -schedule

Data Lab

Wednesday, 30 September 2015

Apache Falcon - Defining a process dependent on Multiple Hive Tables

1 comment: