While working for a project recently I got chance to
work on Apache Falcon, which is a data governance engine that defines,
schedules, and monitors data management policies. Falcon allows Hadoop
administrators to centrally define their data pipelines, and then Falcon uses
those definitions to auto-generate workflows in Apache Oozie.
I faced some issues in setting up my first
Falcon job so in my next few posts I will write more about Falcon.
What
does Apache Falcon do?
Apache Falcon simplifies complicated
data management workflows into generalized entity definitions. Falcon makes it
far easier to:
- Define data pipelines
- Monitor data pipelines in coordination with Ambari, and
- Trace pipelines for dependencies, tagging, audits and lineage.
This results in some common mistakes. Processes might use the wrong copies of data sets. Data sets and processes may be duplicated, and it becomes increasingly more difficult to track down where a particular data set originated.
Falcon addresses these data governance challenges with high-level and reusable “entities” that can be defined once and re-used many times. Data management policies are defined in Falcon entities and manifested as Oozie workflows.
Falcon Features -
1.
Centrally
Manage Data Lifecycle
Ø Centralized
definition & management of pipelines for data ingest, process & export
across different clusters.
2.
Business
Continuity & Disaster Recovery
Ø Out
of the box policies for data replication, retention and archival.
Ø Configuration
and management of late data handling and exception handling.
Ø Handles
process failures and retries.
Ø End
to end monitoring of data pipelines.
3.
Address
audit & compliance requirements
Ø Visualize
data pipeline lineage.
Ø Track
data pipeline audit logs.
Ø Tag
data with business metadata.
No comments:
Post a Comment