Sunday, 24 February 2013

Sample Datasets

Sample Dataset

Sample Dataset for analysis

Last week I setup my Cloudera Cluster, now I want to check what Hadoop and its related stuff can do. So I need some sample datasets to play with, but from where I can get it?
So I started googling around and found that there are some really good and HUGE sample datasets available. Here are few ones which I like:
  1. Airline on-time performance
  2. Yahoo Webscope
  3. FreeBase
Personally, I like the first one. It is hassle-free to download and consists of flight arrival and departure details for all commercial flights within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed.
To download most of the datasets from Yahoo Webscope you need to accept license which asks you to use it for educational or research purposes only and you need a recommendation from your college for accessing it.
You can explore the third one also to get dataset in your area of interest.

My Dataset

I downloaded all files from Airline on-time performance to my CentOS machine, unzipped, removed the header line and dumped them to HDFS. I am pasting the commands executed for this below:

hadoop fs -mkdir /airline
cd path-to-airline-data
for file_name in *bz2;
do
bunzip2 $file_name
base_name=$(echo $file_name | cut -d'.' -f1,2);
awk '{if (NR!=1) {print}}' $base_name > nh_${base_name}
hadoop fs -put nh_${base_name} /airline/
echo $base_name
done

You can explore the data in hdfs using
hadoop fs -ls /airline/
or you can use the web UI to explore it:
http://namenode:50070/
In my next blog I am going to do some MapReduce on this data using Pentaho Data Integrator.

1 comment:

  1. unable to open the file can u please demonstrate via video or image how do we run the above command.

    ReplyDelete