Tag Archives: ddl

DDL Generation for Datalake Hive tables

Purpose:

Objective of this post is to share my past experience in using Apache Metamodel and Velocity template engine.

  1. http://metamodel.apache.org/
  2. http://velocity.apache.org/

Data Engineers Requirement:

While establishing datalake data zones like Landing data zone, Raw Data Zone as a developer one of the time taking task is generating DDL’s to map between source RDBMS tables and Hive tables in datalake.

My observations in various datalake engagements:

Before I proceed I found in my past engagements developers writing their own common interface or abstract components like “DBConnectionFactory”  / “DBSchemaFactory”. One common problem I have seen with this approach is lot of debate on mapping SQL Types with JDBC Types.

Whenever above common interface/factory components development is assigned to “common sense” developers they refereed respective RDBMS documentation for SQL to JDBC types mapping. And in other teams developers have taken their own decisions in mapping SQL to JDBC types.

My experience with Apache Metamodel and Velocity Template Engine:

While I was working on this kind of task (i.e. DDL generation for Datalake Hive tables) I found Apache Metamodel, which provides a common interface across RDBMS stores like Oracle, MySQL, SQL Server etc., is very helpful.

In one of my past engagement we were able to map RDBMS tables to Hive tables and generate Hive DDL’s very quickly (there are more than 500 tables across the source databases). We used Apache Metamodel’s Maven dependency in our Maven project code and developed a wrapper by name MainDDLGenerator to instantiate Apache Metamodel API classes.

As far as DDL Generation is concerned I have used Velocity Template Engine along with MetaModel to externalize the Hive DDL generation logic. Need for a template engine was to support additional TBLPROPERTIES to Hive tables which are applicable only on PROD environment and not to other environments like development/QA/integration etc.

Though we can write IF-ELSE conditions in the Velocity template developing to meet above requirement I felt if MainDDLGenerator component is designed to accept any Velocity Template as one of the input besides source & destination database configs we can make MainDDLGenerator extensible to various scenarios .