We will use these 3 methods in which we can use Avro for serialization/deserialization:
- Using Avro command line tools.
- Using Avro Java API without code generation.
- Using Avro Java API with code generation.
Sample Data
We will use below sample data (StudentActivity.json):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{"id":"A91D021BA58444B29D4D42CA5E39F7BF","student_id":100,"university_id":908,"course_details":{"course_id":100,"enroll_date":"2012-02-13 00:00:00.000000000","verb":"completed","result_score":0.9}} | |
{"id":"502A77CC99B241CB94CA356F5218F1A9","student_id":101,"university_id":112,"course_details":{"course_id":233,"enroll_date":"2011-06-08 00:00:00.000000000","verb":"started","result_score":0.65}} | |
{"id":"5D04CD5ABF014D6EBA237766F9B470DE","student_id":102,"university_id":340,"course_details":{"course_id":339,"enroll_date":"2012-03-06 00:00:00.000000000","verb":"started","result_score":0.57}} |
Defining a schema
Avro schemas are defined using JSON. The avro schema for our sample data is defined as below (StudentActivity.avsc):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"namespace": "com.rishav.avro", | |
"type": "record", | |
"name": "StudentActivity", | |
"fields": [ | |
{ | |
"name": "id", | |
"type": "string" | |
}, | |
{ | |
"name": "student_id", | |
"type": "int" | |
}, | |
{ | |
"name": "university_id", | |
"type": "int" | |
}, | |
{ | |
"name": "course_details", | |
"type": { | |
"name": "Activity", | |
"type": "record", | |
"fields": [ | |
{ | |
"name": "course_id", | |
"type": "int" | |
}, | |
{ | |
"name": "enroll_date", | |
"type": "string" | |
}, | |
{ | |
"name": "verb", | |
"type": "string" | |
}, | |
{ | |
"name": "result_score", | |
"type": "double" | |
} | |
] | |
} | |
} | |
] | |
} |
1. Serialization/Deserialization using Avro command line tools
Avro provides a jar file by name avro-tools-<version>.jar which provides many command line tools as listed below:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ java -jar avro-tools-1.7.5.jar | |
Version 1.7.5 of Apache Avro | |
Copyright 2010 The Apache Software Foundation | |
This product includes software developed at | |
The Apache Software Foundation (http://www.apache.org/). | |
C JSON parsing provided by Jansson and | |
written by Petri Lehtinen. The original software is | |
available from http://www.digip.org/jansson/. | |
---------------- | |
Available tools: | |
cat extracts samples from files | |
compile Generates Java code for the given schema. | |
concat Concatenates avro files without re-compressing. | |
fragtojson Renders a binary-encoded Avro datum as JSON. | |
fromjson Reads JSON records and writes an Avro data file. | |
fromtext Imports a text file into an avro data file. | |
getmeta Prints out the metadata of an Avro data file. | |
getschema Prints out schema of an Avro data file. | |
idl Generates a JSON schema from an Avro IDL file | |
idl2schemata Extract JSON schemata of the types from an Avro IDL file | |
induce Induce schema/protocol from Java class/interface via reflection. | |
jsontofrag Renders a JSON-encoded Avro datum as binary. | |
random Creates a file with randomly generated instances of a schema. | |
recodec Alters the codec of a data file. | |
rpcprotocol Output the protocol of a RPC service | |
rpcreceive Opens an RPC Server and listens for one message. | |
rpcsend Sends a single RPC message. | |
tether Run a tethered mapreduce job. | |
tojson Dumps an Avro data file as JSON, one record per line. | |
totext Converts an Avro data file to a text file. | |
totrevni Converts an Avro data file to a Trevni file. | |
trevni_meta Dumps a Trevni file's metadata as JSON. | |
trevni_random Create a Trevni file filled with random instances of a schema. | |
trevni_tojson Dumps a Trevni file as JSON. |
For converting json sample data to Avro binary format use "fromjson" option and for getting json data back from Avro files use "tojson" option.
Command for serializing json
Without any compression
java -jar avro-tools-1.7.5.jar fromjson --schema-file StudentActivity.avsc StudentActivity.json > StudentActivity.avro
With snappy compression
java -jar avro-tools-1.7.5.jar fromjson --schema-file StudentActivity.avsc StudentActivity.json > StudentActivity.snappy.avro
Command for deserializing json
The same command is used for deserializing both compressed and uncompressed data
java -jar avro-tools-1.7.5.jar tojson StudentActivity.avro
java -jar avro-tools-1.7.5.jar tojson StudentActivity.snappy.avro
As Avro data file contains the schema also, we can retrieve it using this commmand:
java -jar avro-tools-1.7.5.jar getschema StudentActivity.avroIn our next post we will use Avro Java API for serialization/deserialization.
java -jar avro-tools-1.7.5.jar getschema StudentActivity.snappy.avro
No comments:
Post a Comment