Data Lab: Apache Falcon – Basic Concepts

Tuesday, 8 September 2015

Apache Falcon – Basic Concepts

Apache Falcon use only three types of entities to describe all data management policies and pipelines. These entities are:

·         Cluster: Represents the “interfaces” to a Hadoop cluster

·         Feed: Defines a “dataset” File, Hive Table or Stream

·         Process: Consumes feeds, invokes processing logic & produces feeds

Using these three types of entities only we can manage replication, archival, retention of data and also handle job/process failures and late data arrival.

These Falcon entities–

Are easy and simple to define using XML.
Are modular - clusters, feeds & processes defined separately and then linked together and easy to re-use across multiple pipelines.
Can be configured for replication, late data arrival, archival and retention.

Using Falcon a complicated data pipeline like below