Tuesday, 26 February 2013

Lets MapReduce with Pentaho

Lets MapReduce with Pentaho Data Integrator

I was exploring Pentaho Data Integrator for quite some time, and always wanted to see how to work with BigData using Pentaho. Today I got a chance to do some simple MapReduce with Pentaho on the Airline dataset which I had dumped into my Cloudera Hadoop cluster (Sample Datasets).
Before doing MR in Pentaho, we need it to configure it to work with Cloudera (by default it is configured to work with Apache). It can be easily done in 5 minutes by following this link:

Configure Pentaho for Cloudera CDH4

After doing all the listed steps make sure you are able to connect to HDFS through Pentaho. For this:

1. Create a new transformation.

2. Add a Hadoop File Input step.

3. Browse HDFS, give address of your namenode, port as 8020 , user id as hdfs and by default there is no password.

4. If you are able to browse HDFS then you have successfully configured Pentaho.

5. You can discard this transformation. Was just checking the connectivity J.

To do a simple MapReduce in Pentaho we need to create:

1. Mapper transformation

2. Combiner transformation (optional)

3. Reducer transformation

4. MapReduce job

I am going to create a simple MR task which counts the number to flights for each month for our entire airline dataset.

First create a Mapper transformation:

1. Create a new transformation in Pentaho, let us call it “airline_mapper”.

2. Add a MapReduce Input step and configure it as shown:

3. Add a Split Fields Step to split csv file to different fields and connect it with previous step, configure current step as shown:

4. Add a String Operations step to pad month field with leading 0 (to make all month fields as 2 digit integers), connect it with previous step. It should be configure as shown:

 

 5. Add a User Defined Java Expression step to concatenate year with month, connect it with previous step. It should be configure as shown:

 6. Add a MapReduce Output step and connect it with previous step. It should be configure as shown:

 

I have used Carrier field to count number of flights for each month, we can use any other field also for doing same.

7. Now save this transformation. It should like this:

 

Now create a Reducer transformation:

1. Create a new transformation in Pentaho, let us call it “airline_reducer”.

2. Add a MapReduce Input step and configure it as shown:

 3. Add a Group By step and connect it with previous step. It should be configure as shown:

 4. Add a MapReduce Output step, connect it with previous step. It should be configure as shown:

 5. Now save this transformation. It should like this:

 Our mapper and reducer transformation are complete, now let us combine them in a MR job.

Create a MapReduce job:

1. Create a new job in Pentaho, let us call it “airline_job”.

2. Add a Start step and a Pentaho MapReduce step and connect them. It should look like:

 3. Configure Pentaho MapReduce as shown in below screenshots:

 You can give any name in Hadoop Job Name, let us call our job as airline_agg.
Mapper transformation should point to airline_mapper transformation.
Mapper Input/Output Step Name should match the step names with our mapper transformation.

4. We don’t have combiner step, so leave next tab empty.

5. Configure Reducer tab like this:

 Again Reducer Input/Output Step Name should match the step names with our reducer transformation.

6. Configure Job Setup step as shown:

 7. Configure Cluster tab like this:

 8. Save and run this job.

While the job is running we can see its progress in Pentaho and also in JobTracker UI (http://jobtracker_ip:50030/).

 You can also see the details of a running job by clicking on it:

 On my Cloudera setup it took around 9 minutes to execute this job.

I shall try to do some complex analysis on this dataset using Pentaho and post it very soon.