Thursday, 27 February 2014

Convert csv data to Avro data

In one of my previous post I explained how we can convert json data to avro data and vice versa using avro tools command line option. Today I was trying to see what options we have for converting csv data to avro format, as of now we don't have any avro tool option to accomplish this . Now, we can either write our own java program (MapReduce program or a simple java program) or we can use various SerDe's available with Hive to do this quickly and without writing any code :)

To convert csv data to Avro data using Hive we need to follow below steps:
  1. Create a Hive table stored as textfile and specify your csv delimiter also.
  2. Load csv file to above table using "load data" command.
  3. Create another Hive table using AvroSerDe.
  4. Insert data from former table to new Avro Hive table using "insert overwrite" command.

To demonstrate this I will use use below data (student.csv):
0,38,91
0,65,28
0,78,16
1,34,96
1,78,14
1,11,43
Now execute below queries in Hive:
--1. Create a Hive table stored as textfile
USE test;
CREATE TABLE csv_table (
student_id INT,
subject_id INT,
marks INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
--2. Load csv_table with student.csv data
LOAD DATA LOCAL INPATH "/path/to/student.csv" OVERWRITE INTO TABLE test.csv_table;
--3. Create another Hive table using AvroSerDe
CREATE TABLE avro_table
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.literal'='{
"namespace": "com.rishav.avro",
"name": "student_marks",
"type": "record",
"fields": [ { "name":"student_id","type":"int"}, { "name":"subject_id","type":"int"}, { "name":"marks","type":"int"}]
}');
--4. Load avro_table with data from csv_table
INSERT OVERWRITE TABLE avro_table SELECT student_id, subject_id, marks FROM csv_table;
view raw csv_avro.hql hosted with ❤ by GitHub

Now you can get data in Avro format from Hive warehouse folder. To dump this file to local file system use below command:
hadoop fs -cat /path/to/warehouse/test.db/avro_table/* > student.avro

If you want to get json data from this avro file you can use avro tools command:
java -jar avro-tools-1.7.5.jar tojson student.avro > student.json

So we can easily convert csv to avro and csv to json also by just writing 4 HQLs.

Tuesday, 25 February 2014

Implementing Custom WritableComparable

In one of my previous post I wrote about Implementing Custom Writable which can be used as values in MapReduce program. For using customized key in MapReduce we need to implement WritableComparable interface.

WritableComparable interface is just a subinterface of the Writable and java.lang.Comparable interfaces. For implementing a WritableComparable we must have compareTo method apart from readFields and write methods, as shown below:
public interface WritableComparable extends Writable, Comparable
{
    void readFields(DataInput in);
    void write(DataOutput out);
    int compareTo(WritableComparable o)
}
Comparison of types is crucial for MapReduce, where there is a sorting phase during which keys are compared with one another.

The code for IntPair class which is used in In-mapper Combiner Program to Calculate Average post is given below:

package com.hadoop.imcdp;
import java.io.*;
import java.util.Set;
import java.lang.Integer;
import org.apache.hadoop.io.*;
public class IntPair implements WritableComparable<IntPair>{
private IntWritable first;
private IntWritable second;
public IntPair() {
set(new IntWritable(), new IntWritable());
}
public IntPair(Integer first, Integer second) {
set(new IntWritable(first), new IntWritable(second));
}
public void set(IntWritable first, IntWritable second) {
this.first = first;
this.second = second;
}
public IntWritable getFirst() {
return first;
}
public Integer getFirstInt() {
return new Integer(first.toString());
}
public Integer getSecondInt() {
return new Integer(second.toString());
}
public IntWritable getSecond() {
return second;
}
@Override
public void write(DataOutput out) throws IOException {
first.write(out);
second.write(out);
}
@Override
public void readFields(DataInput in) throws IOException {
first.readFields(in);
second.readFields(in);
}
@Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
}
@Override
public boolean equals(Object o) {
if (o instanceof IntPair) {
IntPair tp = (IntPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
@Override
public int compareTo(IntPair tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}
}
view raw IntPair.java hosted with ❤ by GitHub

As you can see in compareTo(IntPair tp) of above class that IntPair needs to be deserialized for comparison to take place, we can implement a RawComparator which can compare two keys by just checking their serialized representation. More on RawComparator is available in Hadoop: The Definitive Guide.

Wednesday, 19 February 2014

Avro Schema Evolution

Avro can use different schemas for serialization and deserialization, and it can handle removed, added and modified fields. Thus it helps in building decoupled and robust systems.

In this post we will serialize data using this schema:
{
"namespace": "com.rishav.avro",
"type": "record",
"name": "StudentActivity",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "student_id",
"type": "int"
},
{
"name": "university_id",
"type": "int"
},
{
"name": "course_details",
"type": {
"name": "Activity",
"type": "record",
"fields": [
{
"name": "course_id",
"type": "int"
},
{
"name": "enroll_date",
"type": "string"
},
{
"name": "verb",
"type": "string"
},
{
"name": "result_score",
"type": "double"
}
]
}
}
]
}

and deserialize it using a different schema
{
"namespace": "com.rishav.avro",
"type": "record",
"name": "StudentActivity",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "student_id",
"type": "int"
},
{
"name": "age",
"type": ["null", "int"],
"default": null
},
{
"name": "course_details",
"type": {
"name": "Activity",
"type": "record",
"fields": [
{
"name": "course_id",
"type": "int"
},
{
"name": "enroll_date",
"type": "string"
},
{
"name": "verb",
"type": "string"
},
{
"name": "score",
"type": "double",
"aliases": ["result_score"]
}
]
}
}
]
}
which has following modifications:
  1. university_id field is removed.
  2. age field is added.
  3. result_score field is renamed to score.
Before we actually see how Avro handles these modification I would like to mention below points:
  • If a new field is added then it must have a default value. Also specify type as an array of types starting with null e.g. "type": ["null", "string"] otherwise you will get this error:
    Exception in thread "main" java.lang.NoSuchMethodError: org.codehaus.jackson.node.ValueNode.asText()Ljava/lang/String;
  • If a field is renamed then the old name must be present as aliases.

In the this java program we serialize data using StudentActivity.avsc schema and deserialize data using StudentActivityNew.avsc schema
package com.rishav.avro;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import java.util.LinkedHashMap;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.BinaryDecoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.codehaus.jackson.JsonFactory;
import org.codehaus.jackson.JsonParseException;
import org.codehaus.jackson.JsonProcessingException;
import org.codehaus.jackson.map.ObjectMapper;
import org.json.simple.JSONObject;
public class AvroSchemaEvolution {
public void serialize() throws JsonParseException, JsonProcessingException, IOException {
InputStream in = new FileInputStream("resources/StudentActivity.json");
// create a schema
Schema schema = new Schema.Parser().parse(new File("resources/StudentActivity.avsc"));
// create a record to hold json
GenericRecord AvroRec = new GenericData.Record(schema);
// create a record to hold course_details
GenericRecord CourseRec = new GenericData.Record(schema.getField("course_details").schema());
// this file will have AVro output data
File AvroFile = new File("resources/StudentActivity.avro");
// Create a writer to serialize the record
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, AvroFile);
// iterate over JSONs present in input file and write to Avro output file
for (Iterator it = new ObjectMapper().readValues(
new JsonFactory().createJsonParser(in), JSONObject.class); it.hasNext();) {
JSONObject JsonRec = (JSONObject) it.next();
AvroRec.put("id", JsonRec.get("id"));
AvroRec.put("student_id", JsonRec.get("student_id"));
AvroRec.put("university_id", JsonRec.get("university_id"));
LinkedHashMap CourseDetails = (LinkedHashMap) JsonRec.get("course_details");
CourseRec.put("course_id", CourseDetails.get("course_id"));
CourseRec.put("enroll_date", CourseDetails.get("enroll_date"));
CourseRec.put("verb", CourseDetails.get("verb"));
CourseRec.put("result_score", CourseDetails.get("result_score"));
AvroRec.put("course_details", CourseRec);
dataFileWriter.append(AvroRec);
} // end of for loop
in.close();
dataFileWriter.close();
} // end of serialize method
public void deserialize () throws IOException {
// create a schema
Schema schema = new Schema.Parser().parse(new File("resources/StudentActivityNew.avsc"));
// Schema schema = new Schema.Parser().parse(new File("resources/StudentActivity.avsc"));
// create a record using schema
GenericRecord AvroRec = new GenericData.Record(schema);
File AvroFile = new File("resources/StudentActivity.avro");
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(AvroFile, datumReader);
System.out.println("Deserialized data is :");
while (dataFileReader.hasNext()) {
AvroRec = dataFileReader.next(AvroRec);
System.out.println(AvroRec);
}
}
public static void main(String[] args) throws JsonParseException, JsonProcessingException, IOException {
AvroSchemaEvolution AvroEx = new AvroSchemaEvolution();
AvroEx.serialize();
AvroEx.deserialize();
}
}

On executing this code we see that Avro handles the modifications without any issues and our data is deserialized properly.

Tuesday, 18 February 2014

Getting started with Avro Part2

In the previous post we used avro-tools commands to serialize and deserialize data. In this post we post we will use Avro Java API for achieving the same. We will use same sample data and schema from our previous post.

The java code for serializing and deserializing data without generating the code for schema is given below:

package com.rishav.avro;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import java.util.LinkedHashMap;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.BinaryDecoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.codehaus.jackson.JsonFactory;
import org.codehaus.jackson.JsonParseException;
import org.codehaus.jackson.JsonProcessingException;
import org.codehaus.jackson.map.ObjectMapper;
import org.json.simple.JSONObject;
public class AvroExampleWithoutCodeGeneration {
public void serialize() throws JsonParseException, JsonProcessingException, IOException {
InputStream in = new FileInputStream("resources/StudentActivity.json");
// create a schema
Schema schema = new Schema.Parser().parse(new File("resources/StudentActivity.avsc"));
// create a record to hold json
GenericRecord AvroRec = new GenericData.Record(schema);
// create a record to hold course_details
GenericRecord CourseRec = new GenericData.Record(schema.getField("course_details").schema());
// this file will have AVro output data
File AvroFile = new File("resources/StudentActivity.avro");
// Create a writer to serialize the record
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, AvroFile);
// iterate over JSONs present in input file and write to Avro output file
for (Iterator it = new ObjectMapper().readValues(
new JsonFactory().createJsonParser(in), JSONObject.class); it.hasNext();) {
JSONObject JsonRec = (JSONObject) it.next();
AvroRec.put("id", JsonRec.get("id"));
AvroRec.put("student_id", JsonRec.get("student_id"));
AvroRec.put("university_id", JsonRec.get("university_id"));
LinkedHashMap CourseDetails = (LinkedHashMap) JsonRec.get("course_details");
CourseRec.put("course_id", CourseDetails.get("course_id"));
CourseRec.put("enroll_date", CourseDetails.get("enroll_date"));
CourseRec.put("verb", CourseDetails.get("verb"));
CourseRec.put("result_score", CourseDetails.get("result_score"));
AvroRec.put("course_details", CourseRec);
dataFileWriter.append(AvroRec);
} // end of for loop
in.close();
dataFileWriter.close();
} // end of serialize method
public void deserialize () throws IOException {
// create a schema
Schema schema = new Schema.Parser().parse(new File("resources/StudentActivity.avsc"));
// create a record using schema
GenericRecord AvroRec = new GenericData.Record(schema);
File AvroFile = new File("resources/StudentActivity.avro");
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(AvroFile, datumReader);
System.out.println("Deserialized data is :");
while (dataFileReader.hasNext()) {
AvroRec = dataFileReader.next(AvroRec);
System.out.println(AvroRec);
}
}
public static void main(String[] args) throws JsonParseException, JsonProcessingException, IOException {
AvroExampleWithoutCodeGeneration AvroEx = new AvroExampleWithoutCodeGeneration();
AvroEx.serialize();
AvroEx.deserialize();
}
}

For generating the schema java code from Avro json schema we can use avro-tools jar. The command for same is given below:
java -jar avro-tools-1.7.5.jar compile schema StudentActivity.avsc <output_path>
Output path can be source folder for the project or we can add the generated java class files to Eclipse IDE manually.

The java code for serializing and deserializing data with generating the code for schema is similar to above code except that in previous code we were assiging values to a GenericRecord and in this one we are assigning values to the generated Avro object:

package com.rishav.avro;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.Iterator;
import java.util.LinkedHashMap;
import org.apache.avro.Schema;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumReader;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.codehaus.jackson.JsonFactory;
import org.codehaus.jackson.JsonParseException;
import org.codehaus.jackson.JsonProcessingException;
import org.codehaus.jackson.map.ObjectMapper;
import org.json.simple.JSONObject;
public class AvroExampleWithCodeGeneration {
public void serialize() throws JsonParseException, JsonProcessingException, IOException {
InputStream in = new FileInputStream("resources/StudentActivity.json");
// create a schema
Schema schema = new Schema.Parser().parse(new File("resources/StudentActivity.avsc"));
// create an object to hold json record
StudentActivity sa = new StudentActivity();
// create an object to hold course_details
Activity a = new Activity();
// this file will have AVro output data
File AvroFile = new File("resources/StudentActivity.avro");
// Create a writer to serialize the record
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, AvroFile);
// iterate over JSONs present in input file and write to Avro output file
for (Iterator it = new ObjectMapper().readValues(
new JsonFactory().createJsonParser(in), JSONObject.class); it.hasNext();) {
JSONObject JsonRec = (JSONObject) it.next();
sa.setId((CharSequence) JsonRec.get("id"));
sa.setStudentId((Integer) JsonRec.get("student_id"));
sa.setUniversityId((Integer) JsonRec.get("university_id"));
LinkedHashMap CourseDetails = (LinkedHashMap) JsonRec.get("course_details");
a.setCourseId((Integer) CourseDetails.get("course_id"));
a.setEnrollDate((CharSequence) CourseDetails.get("enroll_date"));
a.setVerb((CharSequence) CourseDetails.get("verb"));
a.setResultScore((Double) CourseDetails.get("result_score"));
sa.setCourseDetails(a);
dataFileWriter.append(sa);
} // end of for loop
in.close();
dataFileWriter.close();
} // end of serialize method
public void deserialize () throws IOException {
// create a schema
Schema schema = new Schema.Parser().parse(new File("resources/StudentActivity.avsc"));
// create a record using schema
GenericRecord AvroRec = new GenericData.Record(schema);
File AvroFile = new File("resources/StudentActivity.avro");
DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);
DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(AvroFile, datumReader);
System.out.println("Deserialized data is :");
while (dataFileReader.hasNext()) {
AvroRec = dataFileReader.next(AvroRec);
System.out.println(AvroRec);
}
}
public static void main(String[] args) throws JsonParseException, JsonProcessingException, IOException {
AvroExampleWithoutCodeGeneration AvroEx = new AvroExampleWithoutCodeGeneration();
AvroEx.serialize();
AvroEx.deserialize();
}
}

In next post we will see how Avro deals with schema evolution.

Getting started with Avro Part1

In our previous post we got some basic idea about Avro, in this post we will use Avro for serializing and deserializing data.

We will use these 3 methods in which we can use Avro for serialization/deserialization:
  1. Using Avro command line tools.
  2. Using Avro Java API without code generation.
  3. Using Avro Java API with code generation.

Sample Data

We will use below sample data (StudentActivity.json):
{"id":"A91D021BA58444B29D4D42CA5E39F7BF","student_id":100,"university_id":908,"course_details":{"course_id":100,"enroll_date":"2012-02-13 00:00:00.000000000","verb":"completed","result_score":0.9}}
{"id":"502A77CC99B241CB94CA356F5218F1A9","student_id":101,"university_id":112,"course_details":{"course_id":233,"enroll_date":"2011-06-08 00:00:00.000000000","verb":"started","result_score":0.65}}
{"id":"5D04CD5ABF014D6EBA237766F9B470DE","student_id":102,"university_id":340,"course_details":{"course_id":339,"enroll_date":"2012-03-06 00:00:00.000000000","verb":"started","result_score":0.57}}
Note that the JSON records are nested ones.

Defining a schema

Avro schemas are defined using JSON. The avro schema for our sample data is defined as below (StudentActivity.avsc):
{
"namespace": "com.rishav.avro",
"type": "record",
"name": "StudentActivity",
"fields": [
{
"name": "id",
"type": "string"
},
{
"name": "student_id",
"type": "int"
},
{
"name": "university_id",
"type": "int"
},
{
"name": "course_details",
"type": {
"name": "Activity",
"type": "record",
"fields": [
{
"name": "course_id",
"type": "int"
},
{
"name": "enroll_date",
"type": "string"
},
{
"name": "verb",
"type": "string"
},
{
"name": "result_score",
"type": "double"
}
]
}
}
]
}


1. Serialization/Deserialization using Avro command line tools

Avro provides a jar file by name avro-tools-<version>.jar which provides many command line tools as listed below:

$ java -jar avro-tools-1.7.5.jar
Version 1.7.5 of Apache Avro
Copyright 2010 The Apache Software Foundation
This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).
C JSON parsing provided by Jansson and
written by Petri Lehtinen. The original software is
available from http://www.digip.org/jansson/.
----------------
Available tools:
cat extracts samples from files
compile Generates Java code for the given schema.
concat Concatenates avro files without re-compressing.
fragtojson Renders a binary-encoded Avro datum as JSON.
fromjson Reads JSON records and writes an Avro data file.
fromtext Imports a text file into an avro data file.
getmeta Prints out the metadata of an Avro data file.
getschema Prints out schema of an Avro data file.
idl Generates a JSON schema from an Avro IDL file
idl2schemata Extract JSON schemata of the types from an Avro IDL file
induce Induce schema/protocol from Java class/interface via reflection.
jsontofrag Renders a JSON-encoded Avro datum as binary.
random Creates a file with randomly generated instances of a schema.
recodec Alters the codec of a data file.
rpcprotocol Output the protocol of a RPC service
rpcreceive Opens an RPC Server and listens for one message.
rpcsend Sends a single RPC message.
tether Run a tethered mapreduce job.
tojson Dumps an Avro data file as JSON, one record per line.
totext Converts an Avro data file to a text file.
totrevni Converts an Avro data file to a Trevni file.
trevni_meta Dumps a Trevni file's metadata as JSON.
trevni_random Create a Trevni file filled with random instances of a schema.
trevni_tojson Dumps a Trevni file as JSON.

For converting json sample data to Avro binary format use "fromjson" option and for getting json data back from Avro files use "tojson" option.

Command for serializing json
Without any compression
java -jar avro-tools-1.7.5.jar fromjson --schema-file StudentActivity.avsc StudentActivity.json > StudentActivity.avro

With snappy compression
java -jar avro-tools-1.7.5.jar fromjson --schema-file StudentActivity.avsc StudentActivity.json > StudentActivity.snappy.avro

Command for deserializing json
The same command is used for deserializing both compressed and uncompressed data
java -jar avro-tools-1.7.5.jar tojson StudentActivity.avro
java -jar avro-tools-1.7.5.jar tojson StudentActivity.snappy.avro

As Avro data file contains the schema also, we can retrieve it using this commmand:
java -jar avro-tools-1.7.5.jar getschema StudentActivity.avro
java -jar avro-tools-1.7.5.jar getschema StudentActivity.snappy.avro
In our next post we will use Avro Java API for serialization/deserialization.

Monday, 17 February 2014

Introduction to Avro

Apache Avro is a popular data serialization format and is gaining more users as many hadoop based tools natively support Avro for serialization and deserialization.
In this post we will understand some basics about Avro.

What is Avro?
Data serialization system
Uses JSON based schemas
Uses RPC calls to send data
Schema's sent during data exchange

Avro provides:
Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Simple integration with many languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
Avro API's exist for these languages Java, C, C++, C#, Python and Ruby.

Avro Schema:
Avro relies on schemas for serialization/deserialization.
When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program.
If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.
Avro supports a wide range of datatypes which are listed below:
Primitive Types
  • null: no value
  • boolean: a binary value
  • int: 32-bit signed integer
  • long: 64-bit signed integer
  • float: single precision (32-bit) IEEE 754 floating-point number
  • double: double precision (64-bit) IEEE 754 floating-point number
  • bytes: sequence of 8-bit unsigned bytes
  • string: unicode character sequence
Complex Types
Avro supports six kinds of complex types: records, enums, arrays, maps, unions and fixed. Detailed information on these complex types is available here.

Schema Resolution:
A reader of Avro data, whether from an RPC or a file, can always parse that data because its schema is provided. But that schema may not be exactly the schema that was expected. For example, if the data was written with a different version of the software than it is read, then records may have had fields added or removed.

We call the schema used to write the data as the writer's schema, and the schema that the application expects the reader's schema. Differences between these should be resolved as follows:
  • It is an error if the two schemas do not match.
    To match, one of the following must hold:
    • both schemas are arrays whose item types match
    • both schemas are maps whose value types match
    • both schemas are enums whose names match
    • both schemas are fixed whose sizes and names match
    • both schemas are records with the same name
    • either schema is a union
    • both schemas have same primitive type
    • the writer's schema may be promoted to the reader's as follows:
      • int is promotable to long, float, or double
      • long is promotable to float or double
      • float is promotable to double
  • if both are records:
    • the ordering of fields may be different: fields are matched by name.
    • schemas for fields with the same name in both records are resolved recursively.
    • if the writer's record contains a field with a name not present in the reader's record, the writer's value for that field is ignored.
    • if the reader's record schema has a field that contains a default value, and writer's schema does not have a field with the same name, then the reader should use the default value from its field.
    • if the reader's record schema has a field with no default value, and writer's schema does not have a field with the same name, an error is signalled.
  • if both are enums: if the writer's symbol is not present in the reader's enum, then an error is signalled.
  • if both are arrays: This resolution algorithm is applied recursively to the reader's and writer's array item schemas.
  • if both are maps: This resolution algorithm is applied recursively to the reader's and writer's value schemas.
  • if both are unions: The first schema in the reader's union that matches the selected writer's union schema is recursively resolved against it. if none match, an error is signalled.
  • if reader's is a union, but writer's is not The first schema in the reader's union that matches the writer's schema is recursively resolved against it. If none match, an error is signalled.
  • if writer's is a union, but reader's is not If the reader's schema matches the selected writer's schema, it is recursively resolved against it. If they do not match, an error is signalled.
In next post we will see a program to serialization/deserialization some data using avro and also see how Avro handles schema evolution.