Lets MapReduce with Pentaho Data Integrator
I was
exploring Pentaho Data Integrator for quite some time,
and always wanted to see how to work with BigData using Pentaho. Today I got a chance to do some
simple MapReduce with Pentaho
on the Airline dataset which I had dumped into my Cloudera Hadoop cluster (Sample Datasets).
Before
doing MR in Pentaho, we need it to configure it to
work with Cloudera (by default it is configured to
work with Apache). It can be easily done in 5 minutes by following this
link:
After doing all the listed steps make sure you are able to connect to
HDFS through Pentaho. For this:
1. Create a new transformation.
2. Add a Hadoop File Input
step.
3. Browse HDFS, give address of your namenode,
port as 8020 , user id as hdfs and by default there is no
password.
4. If you are able to browse HDFS then you have successfully configured
Pentaho.
5. You can discard this transformation. Was just checking the connectivity J.
To do a simple MapReduce in Pentaho we need to create:
1. Mapper transformation
2. Combiner transformation (optional)
3. Reducer transformation
4. MapReduce job
I am going to create a simple MR task which counts the number to
flights for each month for our entire airline dataset.
First
create a Mapper transformation:
1. Create a new transformation in Pentaho, let us call it
“airline_mapper”.
2. Add a MapReduce Input step and configure it
as shown:
3. Add a Split Fields Step to split csv file
to different fields and connect it with previous step, configure current step as
shown:
4. Add a String Operations step to pad month field with leading 0 (to
make all month fields as 2 digit integers), connect it with previous step. It
should be configure as shown:
5. Add a User Defined Java Expression step to concatenate year with
month, connect it with previous step. It should be configure as
shown:
6. Add a MapReduce Output step and connect it
with previous step. It should be configure as shown:
I have used Carrier field to count number of flights for each month,
we can use any other field also for doing same.
7. Now save this transformation. It should like
this:
Now
create a Reducer transformation:
1. Create a new transformation in Pentaho, let us call it
“airline_reducer”.
2. Add a MapReduce Input step and configure it
as shown:
3. Add a Group By step and connect it with
previous step. It should be configure as shown:
4. Add a MapReduce Output step, connect it
with previous step. It should be configure as shown:
5. Now save this transformation. It should like
this:
Our mapper and reducer transformation are complete, now let us
combine them in a MR job.
Create
a MapReduce job:
1. Create a new job in Pentaho, let us call it
“airline_job”.
2. Add a Start step and a Pentaho MapReduce step and connect them. It should look
like:
3. Configure Pentaho MapReduce as shown in below
screenshots:
You can give any name in Hadoop Job Name, let us call our job as airline_agg.
Mapper transformation should point to airline_mapper transformation.
Mapper Input/Output Step Name should match the step names with our
mapper transformation.
4. We don’t have combiner step, so leave next tab
empty.
5. Configure Reducer tab like this:
Again Reducer Input/Output Step Name should
match the step names with our reducer transformation.
6. Configure Job Setup step as shown:
7. Configure Cluster tab like this:
8. Save and run this job.
While the job is running we can see its progress in Pentaho and also in JobTracker UI
(http://jobtracker_ip:50030/).
You can also see the details of a running job by clicking on
it:
On my Cloudera setup it took around 9 minutes to execute this
job.
I shall try to do some complex analysis on this dataset using Pentaho and post it very soon.