Data Lab: Introduction To Hive's Partitioning

Monday, 14 October 2013

Introduction To Hive's Partitioning

A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can over come this issue by implementing partitions in Hive. Hive makes it very easy to implement partition by using automatic partition scheme when the table is created. In Hive’s implementation of partitioning, data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried, thereby reducing the I/O and time required by the query. Today we are going to see how we can load a csv file to a partitioned table. For this we are going to use Airline OnTime dataset. Loading csv data to a partitioned table involves below mentioned two steps:

Load csv file to a non-partitioned table.
Load non-partitioned table data to partitioned table.

We shall partition Airline OnTime data based on two columns - year and month. 1. Load csv file to a non-partitioned table. We shall create a staging table to hold data from csv file. The hive commands to create schema and table are given below:

create schema stg_airline;
use stg_airline;

create table stg_airline.onTimePerf
(Year INT ,
Month INT ,
DayofMonth INT ,
DayOfWeek INT ,
DepTime INT ,
CRSDepTime INT ,
ArrTime INT ,
CRSArrTime INT ,
UniqueCarrier STRING ,
FlightNum INT ,
TailNum STRING ,
ActualElapsedTime INT ,
CRSElapsedTime INT ,
AirTime STRING ,
ArrDelay INT ,
DepDelay INT ,
Origin STRING ,
Dest STRING ,
Distance INT ,
TaxiIn STRING ,
TaxiOut STRING ,
Cancelled INT ,
CancellationCode STRING ,
Diverted INT ,
CarrierDelay STRING ,
WeatherDelay STRING ,
NASDelay STRING ,
SecurityDelay STRING ,
LateAircraftDelay STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE;

After creating the table load the csv data (note - delete header from csv) to table using below hive command:

LOAD DATA LOCAL INPATH "1987.csv" OVERWRITE INTO TABLE stg_airline.onTimePerf;

2. Load non-partitioned table data to partitioned table. We shall now create a table partitioned by year and month columns, the commands for this are given below:

create schema airline;
use airline;

create table airline.onTimePerf
(DayofMonth INT ,
DayOfWeek INT ,
DepTime INT ,
CRSDepTime INT ,
ArrTime INT ,
CRSArrTime INT ,
UniqueCarrier STRING ,
FlightNum INT ,
TailNum STRING ,
ActualElapsedTime INT ,
CRSElapsedTime INT ,
AirTime STRING ,
ArrDelay INT ,
DepDelay INT ,
Origin STRING ,
Dest STRING ,
Distance INT ,
TaxiIn STRING ,
TaxiOut STRING ,
Cancelled INT ,
CancellationCode STRING ,
Diverted INT ,
CarrierDelay STRING ,
WeatherDelay STRING ,
NASDelay STRING ,
SecurityDelay STRING ,
LateAircraftDelay STRING)
PARTITIONED BY (Year INT, Month INT )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
STORED AS TEXTFILE;

To load partitioned table we use below command:

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT OVERWRITE TABLE airline.onTimePerf PARTITION(Year, Month) SELECT DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime, CRSArrTime, UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, ArrDelay, DepDelay, Origin, Dest, Distance, TaxiIn, TaxiOut, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay, Year, Month FROM stg_airline.onTimePerf;

While writing insert statement for a partitioned table make sure that you specify the partition columns at the last in select clause. The 2 SET commands instruct hive to change our query to dynamically load partitions. If you don't execute the above 2 SET commands you will get below error:

FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

In my next blog I shall describe about using HCatalog in MapReduce program.

20 comments:

Unknown13 February 2015 at 04:55
This is very nice post, thank you.
can you differentiate how bucketing and partition works?
ReplyDelete
Replies
Unknown26 July 2015 at 01:44
its very nice explanaion and very helful can u explain about view and indexs with commands pls
ReplyDelete
Replies
Anonymous7 December 2015 at 22:07
How to rename table names based on partitions
for ex

2013 data in the table 2013_airline.onTimePerf
2014 data in the table 2014_airline.onTimePerf
2015 data in the table 2014_airline.onTimePerf
ReplyDelete
Replies
Unknown1 January 2016 at 22:39
Really nice
ReplyDelete
Replies
Anonymous20 April 2016 at 21:49
HI , i wanna do partitions by date , how can be alter table formate ?
ReplyDelete
Replies
Unknown26 August 2016 at 04:21
Realy good blog for hive partitioning..here also see our blog for hive partitioning and bucketing..http://www.geoinsyssoft.com/hive-partition-bucketing/
ReplyDelete
Replies
Unknown31 October 2016 at 05:05
where is data ?can u please forward the data.
ReplyDelete
Replies
Unknown7 January 2017 at 06:20
INSERT OVERWRITE TABLE airline.onTimePerf PARTITION(Year, Month) SELECT DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime, CRSArrTime, UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, ArrDelay, DepDelay, Origin, Dest, Distance, TaxiIn, TaxiOut, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay, Year, Month FROM stg_airline.onTimePerf;
this is not working for me.
I had to give year='1987'. because i have so many years data from 1987-2008. also i am not suppose to mention year in select query.
ReplyDelete
Replies
Tejuteju9 July 2018 at 23:24
Big data in hadoop is the interesting topic and to get some important information. Big data hadoop online Course Bangalore
ReplyDelete
Replies
Venu16 September 2018 at 23:23
very useful info, but in that airline data forgot to skip header i recommend pls skip header

venu
spark online trainer
ReplyDelete
Replies
Padminiprwatech27 February 2020 at 21:24
You are doing a great job by sharing useful information about Hadoop course. It is one of the post to read and improve my knowledge in Hadoop.You can check our Hadoop Hive Partition example,for more information about Hadoop static partition in hive.
ReplyDelete
Replies
pragyan29 November 2021 at 00:57
az 104 exam questions
scrum master exam questions
dp 900 exam questions
da 100 exam questions
ReplyDelete
Replies
Anonymous1 May 2022 at 22:53
mmorpg oyunlar
instagram takipçi satın al
Tiktok jeton hilesi
TİKTOK JETON HİLESİ
saç ekimi antalya
Instagram takipçi satin al
İnstagram takipçi satın al
metin2 pvp serverlar
instagram takipçi satın al
ReplyDelete
Replies
Anonymous18 May 2022 at 00:27
PERDE MODELLERİ
numara onay
mobil ödeme bozdurma
nft nasıl alınır
ankara evden eve nakliyat
TRAFİK SİGORTASİ
dedektör
web sitesi kurma
aşk kitapları
ReplyDelete
Replies
wpnoob26 June 2024 at 10:02
mt5 download apk
ReplyDelete
Replies
pragyan30 June 2024 at 06:14
forex trading kya hai
ReplyDelete
Replies
Tecblogger23 December 2024 at 21:06
Watch horror movies online like never before with GeoFlixzPlus! Dive into a spine-chilling collection of terrifying classics and modern-day nightmares, all available to stream anytime, anywhere. Get ready for an edge-of-your-seat experience that will leave you craving more.
ReplyDelete
Replies

Add comment