Tuesday, 8 September 2015

Apache Falcon – Basic Concepts

Apache Falcon use only three types of entities to describe all data management policies and pipelines. These entities are:

·         Cluster: Represents the “interfaces” to a Hadoop cluster
·         Feed: Defines a “dataset” File, Hive Table or Stream
·         Process: Consumes feeds, invokes processing logic & produces feeds

Using these three types of entities only we can manage replication, archival, retention of data and also handle job/process failures and late data arrival.

These Falcon entities–

  • Are easy and simple to define using XML.
  • Are modular - clusters, feeds & processes defined separately and then linked together and easy to re-use across multiple pipelines.
  • Can be configured for replication, late data arrival, archival and retention.

Using Falcon a complicated data pipeline like below

can be simplified to a few Falcon entities (which are further converted to multiple Oozie workflows by Falcon engine itself)

In my next post I will explain how we can define a Falcon process and perquisites for that.

Reference - http://www.slideshare.net/Hadoop_Summit/driving-enterprise-data-governance-for-big-data-systems-through-apache-falcon

No comments:

Post a Comment