To convert csv data to Avro data using Hive we need to follow below steps:
- Create a Hive table stored as textfile and specify your csv delimiter also.
- Load csv file to above table using "load data" command.
- Create another Hive table using AvroSerDe.
- Insert data from former table to new Avro Hive table using "insert overwrite" command.
To demonstrate this I will use use below data (student.csv):
0,38,91Now execute below queries in Hive:
0,65,28
0,78,16
1,34,96
1,78,14
1,11,43
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--1. Create a Hive table stored as textfile | |
USE test; | |
CREATE TABLE csv_table ( | |
student_id INT, | |
subject_id INT, | |
marks INT) | |
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' | |
STORED AS TEXTFILE; | |
--2. Load csv_table with student.csv data | |
LOAD DATA LOCAL INPATH "/path/to/student.csv" OVERWRITE INTO TABLE test.csv_table; | |
--3. Create another Hive table using AvroSerDe | |
CREATE TABLE avro_table | |
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' | |
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' | |
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' | |
TBLPROPERTIES ( | |
'avro.schema.literal'='{ | |
"namespace": "com.rishav.avro", | |
"name": "student_marks", | |
"type": "record", | |
"fields": [ { "name":"student_id","type":"int"}, { "name":"subject_id","type":"int"}, { "name":"marks","type":"int"}] | |
}'); | |
--4. Load avro_table with data from csv_table | |
INSERT OVERWRITE TABLE avro_table SELECT student_id, subject_id, marks FROM csv_table; |
Now you can get data in Avro format from Hive warehouse folder. To dump this file to local file system use below command:
hadoop fs -cat /path/to/warehouse/test.db/avro_table/* > student.avro
If you want to get json data from this avro file you can use avro tools command:
java -jar avro-tools-1.7.5.jar tojson student.avro > student.json
So we can easily convert csv to avro and csv to json also by just writing 4 HQLs.
Nice post Rohit, with avro can you change the schema after the table is created? Let's say I got a new column to the existing data source, can I update the avro schema??
ReplyDeletenot very sure if we can alter avro_table. you can try this out
Delete