While establishing datalake data zones like Landing data zone, Raw Data Zone as a developer one of the time taking task is generating DDL’s to map between source RDBMS tables and Hive tables in datalake.
My observations in various datalake engagements:
Before I proceed I found in my past engagements developers writing their own common interface or abstract components like “DBConnectionFactory” / “DBSchemaFactory”. One common problem I have seen with this approach is lot of debate on mapping SQL Types with JDBC Types.
Whenever above common interface/factory components development is assigned to “common sense” developers they refereed respective RDBMS documentation for SQL to JDBC types mapping. And in other teams developers have taken their own decisions in mapping SQL to JDBC types.
My experience with Apache Metamodel and Velocity Template Engine:
While I was working on this kind of task (i.e. DDL generation for Datalake Hive tables) I found Apache Metamodel, which provides a common interface across RDBMS stores like Oracle, MySQL, SQL Server etc., is very helpful.
In one of my past engagement we were able to map RDBMS tables to Hive tables and generate Hive DDL’s very quickly (there are more than 500 tables across the source databases). We used Apache Metamodel’s Maven dependency in our Maven project code and developed a wrapper by name MainDDLGenerator to instantiate Apache Metamodel API classes.
As far as DDL Generation is concerned I have used Velocity Template Engine along with MetaModel to externalize the Hive DDL generation logic. Need for a template engine was to support additional TBLPROPERTIES to Hive tables which are applicable only on PROD environment and not to other environments like development/QA/integration etc.
Though we can write IF-ELSE conditions in the Velocity template developing to meet above requirement I felt if MainDDLGenerator component is designed to accept any Velocity Template as one of the input besides source & destination database configs we can make MainDDLGenerator extensible to various scenarios .
Since 2013 I attended countable Hadoop summits/Strata conferences and found a trend where speakers initially used to highlight importance of Hadoop and its technology stack.
When I last attended in December 2014 I found a new trend where speakers instead of talking more about Hadoop core they introduced Datalake and its architectural components like Raw Data Zone, Metadata management, search, integration tools etc to the community. I was pretty happy when Cloudera released Hadoop Application Architecture book and I recommended to many to study it.
Just like Maven standardized code structure for Java developers, I feel the term “datalake” set a standard in the community by introducing a common vocabulary like Landing Zone/Raw Data Zone between the data engineers.
Me being in consulting and implemented datalake at many enterprises one question that is asked always was “What next… beyond datalake”.
Though I did not attended any session last year (2015) based on inputs I received from my colleagues I felt it is time for Hadoop summit/Strata speakers to introduce to the community what is beyond datalake.
I think “domain specific” data zones are next to datalake data zones; here “domain” I mean Banking, Financial, HealthCare, Financial, Telecom etc. Many enterprises would have already answered the question “What next to datalake” and would have implemented the solutions but such solutions might be locked in source code repositories (like SVN) in the form of conceptual / solution architecture artifacts. One might find many videos or blogs on what kind of solutions big data teams developed beyond “datalake” but I feel consolidating such solutions and introducing a reference architecture, vocabulary to the community through these summits will greatly help data engineers.
Above I might be wrong in expressing my views but thought of sharing my opinion on “beyond datalake”.