Apache Falcon use only three types
of entities to describe all data management policies and pipelines. These
entities are:
· Cluster: Represents the “interfaces” to a Hadoop cluster· Feed: Defines a “dataset” File, Hive Table or Stream· Process: Consumes feeds, invokes processing logic & produces feeds
Using these three types of entities only we can
manage replication, archival, retention of data and also handle job/process
failures and late data arrival.
These Falcon entities–
- Are easy and simple to define using XML.
- Are modular - clusters, feeds & processes defined separately and then linked together and easy to re-use across multiple pipelines.
- Can be configured for replication, late data arrival, archival and retention.
Using Falcon a complicated data pipeline like below
can be simplified to a few Falcon entities (which are
further converted to multiple Oozie workflows by Falcon engine itself)
In my next post I will explain how we can define a Falcon
process and perquisites for that.
ReplyDeleteIt’s interesting content and Great work. Definitely, it will be helpful for others. I would like to follow your blog. Keep post
Check out:
best hadoop training in omr
hadoop training in sholinganallur
big data training in chennai chennai tamil nadu