Putting Avro into Hive

S. Sreekanth, A Sai Ram Pramodhini, Ch S Likita, Chiluka Manisha

Abstract


Avro is an Apache™ open source project that provides data serialization and data exchange services for Hadoop®. These services can be used together or independently. Using Avro, big data can be exchanged between programs written in any language. Using the serialization service, programs can efficiently serialize data into files or into messages. The data storage is compact and efficient. Avro stores both the data definition and the data together in one message or file making it easy for programs to dynamically understand the information stored in an Avro file or message. Avro stores the data definition in JSON format making it easy to read and interpret, the data itself is stored in binary format making it compact and efficient. Avro files include markers that can be used to splitting large data sets into subsets suitable for MapReduce processing. Some data exchange services use a code generator to interpret the data definition and produce code to access the data. Avro doesn't require this step, making it ideal for scripting languages. Overview – Working with Avro from Hive The AvroSerde allows users to read or write Avro data as Hive tables. The AvroSerde's bullet points: Infers the schema of the Hive table from the Avro schema. Starting in Hive 0.14, the Avro schema can be inferred from the Hive table schema. Reads all Avro files within a table against a specified schema, taking advantage of Avro's backwards compatibility abilities Supports arbitrarily nested schemas. Translates all Avro data types into equivalent Hive types. Most types map exactly, but some Avro types don't exist in Hive and are automatically converted by the AvroSerde. Understands compressed Avro files. Transparently converts the Avro idiom of handling nullable types as Union[T, null] into just T and returns null when appropriate. Writes any Hive table to Avro files. Has worked reliably against our most convoluted Avro schemas in our ETL process. Starting in Hive 0.14, columns can be added to an Avro backed Hive table using the Alter Table statement.


Full Text:

PDF




Copyright (c) 2017 Edupedia Publications Pvt Ltd

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

 

All published Articles are Open Access at  https://journals.pen2print.org/index.php/ijr/ 


Paper submission: ijr@pen2print.org