Introduction To Hive's Partitioning
Introduction To Hive's Partitioning
A simple query in Hive reads the entire dataset even if we have where clause filter. This becomes a bottleneck for running MapReduce jobs over a large table. We can over come this issue by implementing partitions in Hive. Hive makes it very easy to implement partition by using automatic partition scheme when the table is created.
In Hive’s implementation of partitioning, data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried, thereby reducing the I/O and time required by the query.
Today we are going to see how we can load a csv file to a partitioned table. For this we are going to use
Airline OnTime dataset. Loading csv data to a partitioned table involves below mentioned two steps:
- Load csv file to a non-partitioned table.
- Load non-partitioned table data to partitioned table.
We shall partition Airline OnTime data based on two columns - year and month.
1. Load csv file to a non-partitioned table.
We shall create a staging table to hold data from csv file. The hive commands to create schema and table are given below:
create schema stg_airline;
use stg_airline;
create table stg_airline.onTimePerf
(Year INT ,
Month INT ,
DayofMonth INT ,
DayOfWeek INT ,
DepTime INT ,
CRSDepTime INT ,
ArrTime INT ,
CRSArrTime INT ,
UniqueCarrier STRING ,
FlightNum INT ,
TailNum STRING ,
ActualElapsedTime INT ,
CRSElapsedTime INT ,
AirTime STRING ,
ArrDelay INT ,
DepDelay INT ,
Origin STRING ,
Dest STRING ,
Distance INT ,
TaxiIn STRING ,
TaxiOut STRING ,
Cancelled INT ,
CancellationCode STRING ,
Diverted INT ,
CarrierDelay STRING ,
WeatherDelay STRING ,
NASDelay STRING ,
SecurityDelay STRING ,
LateAircraftDelay STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
After creating the table load the csv data (note - delete header from csv) to table using below hive command:
LOAD DATA LOCAL INPATH "1987.csv" OVERWRITE INTO TABLE stg_airline.onTimePerf;
2. Load non-partitioned table data to partitioned table.
We shall now create a table partitioned by year and month columns, the commands for this are given below:
create schema airline;
use airline;
create table airline.onTimePerf
(DayofMonth INT ,
DayOfWeek INT ,
DepTime INT ,
CRSDepTime INT ,
ArrTime INT ,
CRSArrTime INT ,
UniqueCarrier STRING ,
FlightNum INT ,
TailNum STRING ,
ActualElapsedTime INT ,
CRSElapsedTime INT ,
AirTime STRING ,
ArrDelay INT ,
DepDelay INT ,
Origin STRING ,
Dest STRING ,
Distance INT ,
TaxiIn STRING ,
TaxiOut STRING ,
Cancelled INT ,
CancellationCode STRING ,
Diverted INT ,
CarrierDelay STRING ,
WeatherDelay STRING ,
NASDelay STRING ,
SecurityDelay STRING ,
LateAircraftDelay STRING)
PARTITIONED BY (Year INT, Month INT )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
To load partitioned table we use below command:
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT OVERWRITE TABLE airline.onTimePerf PARTITION(Year, Month) SELECT DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime, CRSArrTime, UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, ArrDelay, DepDelay, Origin, Dest, Distance, TaxiIn, TaxiOut, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay, Year, Month FROM stg_airline.onTimePerf;
While writing insert statement for a partitioned table make sure that you specify the partition columns at the last in select clause.
The 2 SET commands instruct hive to change our query to dynamically load partitions.
If you don't execute the above 2 SET commands you will get below error:
FAILED: SemanticException [Error 10096]: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict
In my next blog I shall describe about using HCatalog in MapReduce program.
This is very nice post, thank you.
ReplyDeletecan you differentiate how bucketing and partition works?
Hive partitioning and bucketing are very nicely explained here http://www.bidn.com/blogs/cprice1979/ssas/4646/partitions-amp-buckets-in-hive
DeleteThanks Rishav
Deleteits very nice explanaion and very helful can u explain about view and indexs with commands pls
ReplyDeleteHow to rename table names based on partitions
ReplyDeletefor ex
2013 data in the table 2013_airline.onTimePerf
2014 data in the table 2014_airline.onTimePerf
2015 data in the table 2014_airline.onTimePerf
I am not sure if that can be achieved directly using Hive queries, but you can easily write some shell script which can achieve this.
DeleteYou can also look into multi table insert for achieving this.
Really nice
ReplyDeleteHI , i wanna do partitions by date , how can be alter table formate ?
ReplyDeleteRealy good blog for hive partitioning..here also see our blog for hive partitioning and bucketing..http://www.geoinsyssoft.com/hive-partition-bucketing/
ReplyDeletewhere is data ?can u please forward the data.
ReplyDeleteINSERT OVERWRITE TABLE airline.onTimePerf PARTITION(Year, Month) SELECT DayofMonth, DayOfWeek, DepTime, CRSDepTime, ArrTime, CRSArrTime, UniqueCarrier, FlightNum, TailNum, ActualElapsedTime, CRSElapsedTime, AirTime, ArrDelay, DepDelay, Origin, Dest, Distance, TaxiIn, TaxiOut, Cancelled, CancellationCode, Diverted, CarrierDelay, WeatherDelay, NASDelay, SecurityDelay, LateAircraftDelay, Year, Month FROM stg_airline.onTimePerf;
ReplyDeletethis is not working for me.
I had to give year='1987'. because i have so many years data from 1987-2008. also i am not suppose to mention year in select query.
Big data in hadoop is the interesting topic and to get some important information. Big data hadoop online Course Bangalore
ReplyDeletevery useful info, but in that airline data forgot to skip header i recommend pls skip header
ReplyDeletevenu
spark online trainer
You are doing a great job by sharing useful information about Hadoop course. It is one of the post to read and improve my knowledge in Hadoop.You can check our Hadoop Hive Partition example,for more information about Hadoop static partition in hive.
ReplyDeleteaz 104 exam questions
ReplyDeletescrum master exam questions
dp 900 exam questions
da 100 exam questions
mmorpg oyunlar
ReplyDeleteinstagram takipçi satın al
Tiktok jeton hilesi
TİKTOK JETON HİLESİ
saç ekimi antalya
Instagram takipçi satin al
İnstagram takipçi satın al
metin2 pvp serverlar
instagram takipçi satın al
PERDE MODELLERİ
ReplyDeletenumara onay
mobil ödeme bozdurma
nft nasıl alınır
ankara evden eve nakliyat
TRAFİK SİGORTASİ
dedektör
web sitesi kurma
aşk kitapları
mt5 download apk
ReplyDeleteforex trading kya hai
ReplyDelete