Monday 17 February 2014

Introduction to Avro

Apache Avro is a popular data serialization format and is gaining more users as many hadoop based tools natively support Avro for serialization and deserialization.
In this post we will understand some basics about Avro.

What is Avro?
Data serialization system
Uses JSON based schemas
Uses RPC calls to send data
Schema's sent during data exchange

Avro provides:
Rich data structures.
A compact, fast, binary data format.
A container file, to store persistent data.
Simple integration with many languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
Avro API's exist for these languages Java, C, C++, C#, Python and Ruby.

Avro Schema:
Avro relies on schemas for serialization/deserialization.
When Avro data is stored in a file, its schema is stored with it, so that files may be processed later by any program.
If the program reading the data expects a different schema this can be easily resolved, since both schemas are present.
Avro supports a wide range of datatypes which are listed below:
Primitive Types
  • null: no value
  • boolean: a binary value
  • int: 32-bit signed integer
  • long: 64-bit signed integer
  • float: single precision (32-bit) IEEE 754 floating-point number
  • double: double precision (64-bit) IEEE 754 floating-point number
  • bytes: sequence of 8-bit unsigned bytes
  • string: unicode character sequence
Complex Types
Avro supports six kinds of complex types: records, enums, arrays, maps, unions and fixed. Detailed information on these complex types is available here.

Schema Resolution:
A reader of Avro data, whether from an RPC or a file, can always parse that data because its schema is provided. But that schema may not be exactly the schema that was expected. For example, if the data was written with a different version of the software than it is read, then records may have had fields added or removed.

We call the schema used to write the data as the writer's schema, and the schema that the application expects the reader's schema. Differences between these should be resolved as follows:
  • It is an error if the two schemas do not match.
    To match, one of the following must hold:
    • both schemas are arrays whose item types match
    • both schemas are maps whose value types match
    • both schemas are enums whose names match
    • both schemas are fixed whose sizes and names match
    • both schemas are records with the same name
    • either schema is a union
    • both schemas have same primitive type
    • the writer's schema may be promoted to the reader's as follows:
      • int is promotable to long, float, or double
      • long is promotable to float or double
      • float is promotable to double
  • if both are records:
    • the ordering of fields may be different: fields are matched by name.
    • schemas for fields with the same name in both records are resolved recursively.
    • if the writer's record contains a field with a name not present in the reader's record, the writer's value for that field is ignored.
    • if the reader's record schema has a field that contains a default value, and writer's schema does not have a field with the same name, then the reader should use the default value from its field.
    • if the reader's record schema has a field with no default value, and writer's schema does not have a field with the same name, an error is signalled.
  • if both are enums: if the writer's symbol is not present in the reader's enum, then an error is signalled.
  • if both are arrays: This resolution algorithm is applied recursively to the reader's and writer's array item schemas.
  • if both are maps: This resolution algorithm is applied recursively to the reader's and writer's value schemas.
  • if both are unions: The first schema in the reader's union that matches the selected writer's union schema is recursively resolved against it. if none match, an error is signalled.
  • if reader's is a union, but writer's is not The first schema in the reader's union that matches the writer's schema is recursively resolved against it. If none match, an error is signalled.
  • if writer's is a union, but reader's is not If the reader's schema matches the selected writer's schema, it is recursively resolved against it. If they do not match, an error is signalled.
In next post we will see a program to serialization/deserialization some data using avro and also see how Avro handles schema evolution.

No comments:

Post a Comment