Pages

Monday, 27 October 2014

Apache AVRO - Data Serialization Framework

AVRO is an Apache open source project for data serialization and data exchange services for Hadoop .Using Avro (which functions similar to systems such as Apache Thrift, Protocol Buffers- Google's) data can be exchanged between programs written in any language. Avro is gaining new users compared to other popular serialization frameworks, for the reason that many Hadoop based tools support Avro for serialization and De-serialization.


Before we get into the features lets understand about Serialization & De-serialization.
Serialization means turning the structured objects into a bytes stream for transmission over the network or for writing to persistent storage.
De-serialization is the opposite of serialization, where we read the bytes stream or stored persistent storage and turns them into structured objects.
The serialized data which is in a binary format is accompanied with schemas allowing any application to de serialize the data.

Some Features Of Avro
  • Avro serialized data doesn't require proxy objects or code generation (unless desired for statically-typed languages). Avro uses definitions at runtime during data exchange. It always stores data structure definitions with the data making it easier to process rather than going for code generation.
  • When data is read from AVRO less amount of information needs to be encoded with the data since schema is present, ensuing in smaller serialization size.
  •  Avro uses JSON to define a data structure schema.A simple schema example: emp.avsc:

{“namespace”: “test123.avro”,
“type”: “record”,
“name”: “EmpName”,
“fields”: [{"name": "Name", "type": "string"},
{"name": "ID", "type": "int"},
{"name": "Dept", "type": "string"}
{"name": "Sal", "type": "float"},
          ]}
  • Avro lets to define Remote Procedure Call (RPC) protocols to send data. Avro Remote Procedure Call interface is specified in JSON.
  • Avro API's exist for languages like Java, C, C++, C#, Python and Ruby.
  • Avro has the data format to support data-intensive applications.
  • Avro is fast and compact and can be used along with Hadoop Map Reduce together.
  • Avro handles the schema changes like missing fields, added fields and changed fields.
  • Avro supports a rich set of primitive data types including:   
    • Null
    • Boolean
    • Int: 32 bit signed integer
    • Long: 64-bit signed integer
    • Float
    • Double
    • Bytes
    • String
It also supports complex types including arrays, maps, enumerations and records. 

No comments:

Post a Comment