Data driven value creation is a key success factor for every company. Value from data can be achieved in many ways, be it by increasing sales, by decreasing cost, by decreasing risk or — the most exciting and also most complex practice — by identifying and creating new lines of business.
The tool to create value from data is often called the “Information Factory”, a complex system which works data, processes data to extract information and ultimately value. One important task of the Information Factory is the transportation and transformation of data.
At a technical level, many different formats are available to handle data. XML documents or comma-separated files (CSV) as exported from Excel have been around for some time. In the context of hadoop, data lakes and nowadays data lakehouses, data formats optimised for speed and interoperability have been introduced. One of these fast and efficient file formats is avro.
avro is a self-documenting row-based storage format which combines data definition, i.e. the schema and the data itself in one file. Many programming languages such as java or python provide avro support, and many data systems such as Google BigQuery import data from avro files. BigQuery is a database as a service designed for large data volumes and highest speeds.
In this tutorial, you learn how avro files are created using the java programming language. To create the avro schema, we make use of the schema information provided by a database. Similar approaches can be used for XML or JSON files as source data. In this tutorial, first a sample database is created, using H2, which contains just one table. The content of this table is then stored in an avro file.
Without further ado, let’s start and create the database:
The database is now available and the avro file writing can start. JDBC, the java database API, provides metadata, such as column names or column types for every query. Let’s make use of this:
Now everything is prepared and the avro schema can be created. Of course, data type mapping could be applied in the next two steps. For an overview:
Let’s look into the schema creation first:
This is a very basic mapping — you would have to adapt it to your needs.
The final step is the appending of the actual data to the avro file. The avro library provides the necessary API:
A simple mapping would look like this:
In practice, the mapping would be more detailed and handle null values, e.g. as follows:
The code above copies the content of the sample in-memory data to a sample avro file.
For verification, the result can be read e.g. in python:
>>> import pandas >>> import fastavro >>> def avro_df(filepath, encoding): with open(filepath, encoding) as fp: reader = fastavro.reader(fp) records = [r for r in reader] df = pandas.DataFrame.from_records(records) return df >>> df = avro_df("address.avro", "rb") >>> df ID NAME CREATED_AT 0 10 Test 2021-10-16T00:00:00.000+0200 1 20 Test 2 2021-10-16T00:00:00.000+0200
For the code to work, two dependencies need to be resolved, avro and — for compression — snappy. For example:
With this we have successfully created an avro file. Here we extract data from an H2 database, create a new schema, map the data and then write the avro file. The different steps can be adapted to your needs, such as changing the data source to xml files or mapping more data types as for example date and time. This approach gives you a solid framework on how to write avro files using java. Overall it gives a starting point on how to make use of all the benefits avro files offer.