Tuesday, 18 February 2014

Getting started with Avro Part1

In our previous post we got some basic idea about Avro, in this post we will use Avro for serializing and deserializing data.

We will use these 3 methods in which we can use Avro for serialization/deserialization:
  1. Using Avro command line tools.
  2. Using Avro Java API without code generation.
  3. Using Avro Java API with code generation.

Sample Data

We will use below sample data (StudentActivity.json):
{"id":"A91D021BA58444B29D4D42CA5E39F7BF","student_id":100,"university_id":908,"course_details":{"course_id":100,"enroll_date":"2012-02-13 00:00:00.000000000","verb":"completed","result_score":0.9}}
{"id":"502A77CC99B241CB94CA356F5218F1A9","student_id":101,"university_id":112,"course_details":{"course_id":233,"enroll_date":"2011-06-08 00:00:00.000000000","verb":"started","result_score":0.65}}
{"id":"5D04CD5ABF014D6EBA237766F9B470DE","student_id":102,"university_id":340,"course_details":{"course_id":339,"enroll_date":"2012-03-06 00:00:00.000000000","verb":"started","result_score":0.57}}
Note that the JSON records are nested ones.

Defining a schema

Avro schemas are defined using JSON. The avro schema for our sample data is defined as below (StudentActivity.avsc):
{
"namespace": "com.rishav.avro",
"type": "record",
"name": "StudentActivity",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "student_id",
"type": "int"
},
{
"name": "university_id",
"type": "int"
},
{
"name": "course_details",
"type": {
"name": "Activity",
"type": "record",
"fields": [
{
"name": "course_id",
"type": "int"
},
{
"name": "enroll_date",
"type": "string"
},
{
"name": "verb",
"type": "string"
},
{
"name": "result_score",
"type": "double"
}
]
}
}
]
}


1. Serialization/Deserialization using Avro command line tools

Avro provides a jar file by name avro-tools-<version>.jar which provides many command line tools as listed below:

$ java -jar avro-tools-1.7.5.jar
Version 1.7.5 of Apache Avro
Copyright 2010 The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).
C JSON parsing provided by Jansson and
written by Petri Lehtinen. The original software is
available from http://www.digip.org/jansson/.
----------------
Available tools:
cat extracts samples from files
compile Generates Java code for the given schema.
concat Concatenates avro files without re-compressing.
fragtojson Renders a binary-encoded Avro datum as JSON.
fromjson Reads JSON records and writes an Avro data file.
fromtext Imports a text file into an avro data file.
getmeta Prints out the metadata of an Avro data file.
getschema Prints out schema of an Avro data file.
idl Generates a JSON schema from an Avro IDL file
idl2schemata Extract JSON schemata of the types from an Avro IDL file
induce Induce schema/protocol from Java class/interface via reflection.
jsontofrag Renders a JSON-encoded Avro datum as binary.
random Creates a file with randomly generated instances of a schema.
recodec Alters the codec of a data file.
rpcprotocol Output the protocol of a RPC service
rpcreceive Opens an RPC Server and listens for one message.
rpcsend Sends a single RPC message.
tether Run a tethered mapreduce job.
tojson Dumps an Avro data file as JSON, one record per line.
totext Converts an Avro data file to a text file.
totrevni Converts an Avro data file to a Trevni file.
trevni_meta Dumps a Trevni file's metadata as JSON.
trevni_random Create a Trevni file filled with random instances of a schema.
trevni_tojson Dumps a Trevni file as JSON.

For converting json sample data to Avro binary format use "fromjson" option and for getting json data back from Avro files use "tojson" option.

Command for serializing json
Without any compression
java -jar avro-tools-1.7.5.jar fromjson --schema-file StudentActivity.avsc StudentActivity.json > StudentActivity.avro

With snappy compression
java -jar avro-tools-1.7.5.jar fromjson --schema-file StudentActivity.avsc StudentActivity.json > StudentActivity.snappy.avro

Command for deserializing json
The same command is used for deserializing both compressed and uncompressed data
java -jar avro-tools-1.7.5.jar tojson StudentActivity.avro
java -jar avro-tools-1.7.5.jar tojson StudentActivity.snappy.avro

As Avro data file contains the schema also, we can retrieve it using this commmand:
java -jar avro-tools-1.7.5.jar getschema StudentActivity.avro
java -jar avro-tools-1.7.5.jar getschema StudentActivity.snappy.avro
In our next post we will use Avro Java API for serialization/deserialization.

No comments:

Post a Comment