Tuesday 28 July 2015

Hive - Initialization Script

In many of Hive scripts there is use of some Hive UDFs and Hive configuration settings. Declaring these UDFs and Hive configuration variable across many scripts and/or at start of each Hive session is very cumbersome task and maintaining multiple scripts becomes more and more difficult with time.

To overcome these problems we can follow either below methods -
  1. Create an environment file which executed on start of each Hive cli session.
  2. Create an initialization script file which is initialized before start of Hive cli session or before execution of other Hive script.

Let us see each of the above methods in more details -
1. Create an environment file -
Hive executes an environment file by name .hiverc, present in home directory of user. All commands present in this file are executed before start of each Hive cli seesion, hive -e or hive -f options.
By default .hiverc file will not be present in user's home directory and user needs to create this.

2. Create an initialization script file
This option is more useful when we want to execute different set of initialization commands for different set of scripts.
To use this option create a text file with all initialization commands and use it in hive using hive -i <filename> option. This option can be used with Hive/beeline cli or -e or -f non-interactive modes.

An example initialization script looks like this (lets assume filename is hive_init.hql) -

SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

SET hive.vectorized.execution.enabled=false;
SET hive.vectorized.execution.reduce.enabled=false;

SET hive.enforce.bucketing=true;

SET hive.exec.parallel=true;

SET hive.auto.convert.join=false;
SET hive.enforce.bucketmapjoin=true;
SET hive.optimize.bucketmapjoin.sortedmerge=true;
SET hive.optimize.bucketmapjoin=true;

ADD JAR /hadoop/rishav/udfs-0.0.1-SNAPSHOT.jar;
create temporary function getEvent as 'com.rishav.bigdata.hive.udfs.GetEventUDF';
create temporary function get_last_day as 'com.rishav.bigdata.hive.udfs.LastDayUDF';

To use this in different modes use
Hive CLI - hive -i hive_init.hql
Hive non-interactive command mode - hive -i hive_init.hql -e "sql command;"
Hive non-interactive script mode - hive -i hive_init.hql -f <script_file>