DDL Generation for Datalake Hive tables

Purpose:

Objective of this post is to share my past experience in using Apache Metamodel and Velocity template engine.

  1. http://metamodel.apache.org/
  2. http://velocity.apache.org/

Data Engineers Requirement:

While establishing datalake data zones like Landing data zone, Raw Data Zone as a developer one of the time taking task is generating DDL’s to map between source RDBMS tables and Hive tables in datalake.

My observations in various datalake engagements:

Before I proceed I found in my past engagements developers writing their own common interface or abstract components like “DBConnectionFactory”  / “DBSchemaFactory”. One common problem I have seen with this approach is lot of debate on mapping SQL Types with JDBC Types.

Whenever above common interface/factory components development is assigned to “common sense” developers they refereed respective RDBMS documentation for SQL to JDBC types mapping. And in other teams developers have taken their own decisions in mapping SQL to JDBC types.

My experience with Apache Metamodel and Velocity Template Engine:

While I was working on this kind of task (i.e. DDL generation for Datalake Hive tables) I found Apache Metamodel, which provides a common interface across RDBMS stores like Oracle, MySQL, SQL Server etc., is very helpful.

In one of my past engagement we were able to map RDBMS tables to Hive tables and generate Hive DDL’s very quickly (there are more than 500 tables across the source databases). We used Apache Metamodel’s Maven dependency in our Maven project code and developed a wrapper by name MainDDLGenerator to instantiate Apache Metamodel API classes.

As far as DDL Generation is concerned I have used Velocity Template Engine along with MetaModel to externalize the Hive DDL generation logic. Need for a template engine was to support additional TBLPROPERTIES to Hive tables which are applicable only on PROD environment and not to other environments like development/QA/integration etc.

Though we can write IF-ELSE conditions in the Velocity template developing to meet above requirement I felt if MainDDLGenerator component is designed to accept any Velocity Template as one of the input besides source & destination database configs we can make MainDDLGenerator extensible to various scenarios .

Advertisements

Beyond Datalake in Hadoop summits

Since 2013 I attended countable Hadoop summits/Strata conferences and found a trend where speakers initially used to highlight importance of Hadoop and its technology stack.

When I last attended in December 2014 I found a new trend where speakers instead of talking more about Hadoop core they introduced Datalake and its architectural components like Raw Data Zone, Metadata management, search,  integration tools etc to the community. I was pretty happy when Cloudera released Hadoop Application Architecture book and I recommended to many to study it.

Just like Maven standardized code structure for Java developers, I feel the term “datalake” set a standard in the community by introducing a common vocabulary like Landing Zone/Raw Data Zone between the data engineers.

Me being in consulting and implemented datalake at many enterprises one question that is asked always was “What next… beyond datalake”.

Though I did not attended any session last year (2015) based on inputs I received from my colleagues I felt it is time for Hadoop summit/Strata speakers to introduce to the community what is beyond datalake.

I think “domain specific” data zones are next to datalake data zones; here “domain” I mean Banking, Financial, HealthCare, Financial, Telecom etc. Many enterprises would have already answered the question “What next to datalake” and would have implemented the solutions but such solutions might be locked in source code repositories (like SVN) in the form of conceptual / solution architecture artifacts. One might find many videos or blogs on what kind of solutions big data teams developed beyond “datalake” but I feel consolidating such solutions and introducing a reference architecture, vocabulary to the community through these summits will greatly help data engineers.

Above I might be wrong in expressing my views but thought of sharing  my opinion on “beyond datalake”.

 

 

 

 

 

Greenplum Sandbox VM and Cloudera Quickstart VM integration

Purpose of this integration :

I take this opportunity to share my experiences in integrating Greenplum Sandbox VM with Cloudera Quickstart VM.

Purpose of this integration is to create Greenplum external tables using its gphdfs protocol and accessing the data in HDFS.

While there are many articles/videos helping community on how to setup Cloudera VM or Greenplum VM this blog post importance is on how to integrate Cloudera Quickstart VM with Greenplum Sandbox VM.

If you know how to setup Cloudera Quickstart VM and Greenplum Sandbox VM then you can directly to Step 4 below i.e. installing CDH manually in Greenplum Sandbox VM.

What you need for this integration:

  • Oracle Virtual Box
  • Cloudera Quickstart VM
  • Greenplum Sandbox VM
  • TPC-DS dataset (you need Internet connection for this)
  • Understanding of Oracle Virutalbox 4 network adapters configuration

At the time of this article I used following versions of Greenplum Sandbox and Cloudera Quickstart VM :-

Greenplum VM – PivotalGPDB-Sandbox-Centos-6.7-x86_64-vmware
Cloudera VM – cloudera-quickstart-vm-5.4.2-0-virtualbox

Step 1: Attach Cloudera Quickstart VM to Oracle Virtual Box

  1. Download latest Cloudera Quickstart VM from cloudera.com site and attach it to Oracle VirtualBox
  2. Before your start Cloudera Quickstart VM right click  on that VM in Oracle Virtualbox and click on Settings (check below image)
  3. Screen Shot 2016-02-20 at 1.33.18 PM
  4. In settings context menu select Network and add all 4 network adapters
    1. Adapter 1 -> NAT
    2. Adapter 2 -> Host-only Adapter
    3. Adapter 3 -> Internal Network
    4. Adapter 4 -> Bridge Adapter (this might not work if you are trying from your office because this will try to get an IP Address from your network router. If it does not work then do not set this up)
  5. After above settings Start the Cloudera Quickstart VM in Oracle Virtual Box
  6. Now click on terminal/shell icon within the Cloudera Quickstart Vm and type the command ifconfig to know its IP addresses.
    1. In my case its IP address is 192.168.2.44 (I mean Adapter 4 IP address i.e. Bridge Adapter IP address)
      Screen Shot 2016-02-20 at 1.54.41 PM

Step 2: Attach Greenplum Sandbox VM to VMWare Fusion/Player

  1. Download Greenplum Sandbox VM from Greeplum database site. http://greenplum.org/
  2. Follow steps detailed in Greeplum site on how to attach Greenplum Sandbox VM and start the Greenplum Sandbox VM – https://network.pivotal.io/products/pivotal-gpdb#/releases/567/file_groups/337
  3. You don’t need to setup network adapters for this Greenplum Sandbox VM because we need data movement (for Greenplum external tables) from Cloudera Quickstart VM to Greenplum Sandbox VM and not the other way.
    1. This means Greenplum Sandbox VM should be able to ping or access the Cloudera Quickstart VM. For this reason we don’t need to specify the network adapters for Greenplum Sandbox VM
  4. Start the Greenplum Sandbox VM in VMWare and also start the Greenplum database as detailed in above Pivotal GPDB release link
  5. Once you login as gpadmin in Greenplum Sandbox VM and you see command line in Greenplum Sandbox VM type the ifconfig command and note the IP address of Greenplum Sandbox VM
    1. In my case its (i.e. Greenplum Sandbox VM) IP address is 192.168.221.168
    2. Screen Shot 2016-02-20 at 2.19.04 PM

Step 3: Install Oracle JDK in Greenplum Sandbox VM

I found that Greenplum Sandbox VM uses JDK 1.6 since I am using latest version of Cloudera CDH below I decided to upgrade to JDK 1.8.

 

Follow the steps detailed in this link on how to install Oracle JDK 1.8 in Greenplum Sandbox VM – http://tecadmin.net/install-java-8-on-centos-rhel-and-fedora/#

Note: Somehow when I executed the command mentioned in “Setup PATH Variable” in above link i.e. setting the JDK path in PATH variable. For this reason I did not executed or made PATH variable changes.

Step 4: Install CDH manually in Greenplum Sandbox VM

For gphdfs protocol to communicate with Cloudera Quickstart VM while querying the external tables with data in HDFS we need to first provide location of CDH installation. Since Greenplum Sandbox VM does not have CDH installed we need to follow the Manual installation steps detailed in cloudera.com

At the time of this writing I installed CDH 5.5 manually and followed the steps (manual installation) as it is from this link CDH 5.5 Manual Installation

Here are the high level steps I executed to install CDH 5.5 in Greenplum Sandbox VM:

NOTE: You need to install / execute these steps in Greenplum Sandbox VM.

  • Downloaded the link from “To download and install the CDH 5 “1-click Install” package: RHEL/CentOS/Oracle 6 link”
  • Install the RPM for all RHEL versions:
  • Step 2: Optionally Add a Repository Key – For RHEL/CentOS/Oracle 6 systems:
  • Step 3: Install CDH 5 with YARN – Install and deploy ZooKeeper.
  • Step 4: Install each type of daemon package on the appropriate systems(s), as follows. – RHEL/CentOS compatible
  • Step 4: Install CDH 5 with MRv1 – you need to run this and only run the “JobTracker host running: RHEL/CentOS compatible” –  I found I need to install even MRV1 jar files. Do not run other Hadoop components like NameNode etc for MRV1
  • Then you need immediately jump to Step 6 – Deploying CDH section. Specifically you need to go to this link

Step 5: Create gpadmin Linux user & HDFS home in Cloudera Quickstart VM (optional)

This is an optional step, I just created to ensure Greenplum Sandbox VM Linux user (gpadmin) accessing the HDFS data through Greenplum External tables has a Linux login even in Cloudera Quickstart VM

In Cloudera Quickstart VM execute these steps (reason why I mentioned sudo in the beginning of every command is – assuming you are in cloudera Linux user in Cloudera Quickstart VM)

  1. sudo groupadd gpadmin
  2. sudo useradd gpadmin -g gpadmin
  3. sudo passwd gpadmin
  4. <give a simple password for this login in Cloudera Quickstart VM)
  5. Now create a HDFS home directory for this gpadmin user
  6. sudo -u hdfs hadoop fs -mkdir -p /user/gpadmin
  7. sudo -u hdfs hadoop fs -chmod -R 755 /user/gpadmin
  8. sudo -u hdfs hadoop fs -chown -R gpadmin:gpadmin /user/gpadmin

Step 6: Populate TPC-DS dataset in Cloudera Quickstart VM Hive

I have plans to run TPC-DS queries later (not in these VM’s though 🙂 ) so just to test Greenplum external tables with data in HDFS I populated TPC-DS 1 GB dataset in Cloudera Quickstart VM.

You can download the scripts from below link. In case below link is broken just Google for “hive tpc-ds testbench” https://github.com/hortonworks/hive-testbench

This is what I did in Cloudera Quickstart VM:

  1. Login as cloudera user in Cloudera Quickstart VM
  2. Create a directory for TPC-DS Zip file and its build by running below commands
    1. sudo mkdir -p /usr/lib/tpcds
    2. sudo chmod -R 777 /usr/lib/tpcds
      1. (I understand 777 is not a good practice but this is just a VM and learning exercise so I relaxed on best practice)
  3. Now login as gpadmin user by running this command
    1. su – gpadmin
    2. <Enter the gpadmin user password you have set above>
  4. Assuming you logged in as “gpadmin” user run below command
    1. cd /usr/lib/tpcds
  5. Run this command
    1. wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip
    2. Somehow when I downloaded above file it stored as hive4 file without any extension, to fix it I executed below command
    3. mv hive14 hive14.zip
  6. Unzip the file i.e.
    1. unzip hive14.zip
  7. Then run these commands
    1. cd /usr/lib/hive-testbench-hive14

    2. ./tpcds-build.sh

    3. ./tpch-build.sh

  8. Now create a HDFS directory as a base path for TPCDS dataset
    1. Assuming you logged in as gpadmin Linux user in Cloudera Quickstart VM execute below command

      hadoop fs -mkdir -p /user/gpadmin/tpcds-data

  9. Now generate a 2GB dataset ( I found TPC-DS scale-factor to be more than 1 for this reason creating 2GB dataset)
    1. cd /usr/lib/hive-testbench-hive14
    2. hadoop fs -mkdir -p /user/gpadmin/tpcds-data

    3. hadoop fs -chmod -R 755 /user/gpadmin/tpcds-data
    4. ./tpcds-setup 2 /user/gpadmin/tpcds-data
    5. <Above command will take at least 9 minutes to populate dataset>
    6. <Same command will try to optimize the data>
    7. <Don’t want until optimization is complete. Just observer for below console output>
      1. TPC-DS text data generation complete.
      2. Loading text data into external tables.
      3. Optimizing table store_sales (1/24).
    8. <Once you see the above 3 lines of code let the command continue but you can continue with Greenplum External tables setup step>

Step 7: Know about TPC-DS Hive tables

After you run above TPC-DS Hive testbench scripts you will find few Hive tables created in Cloudera Quickstart VM Hive.

In your local browser enter this link http://192.168.2.44:8888/beeswax/#query

(You need to change the IP address above from 192.168.2.44 to whatever is appropriate for you. This IP address is Cloudera Quickstart VM’s Virtual Box Adapter 4 Network address)

Important Note: Database names are prefixed with dataset total size you specified in the above tpcds-setup.sh shell script. Since I specified 2GB the database name created is tpcds_text_2, if you specified other tpc-ds dataset total size value then database name will be prefixed with that value.

For this blog post I am considering tpcds_text_2 database and “reason” table. Since “reason” table has very few columns I selected this table 🙂

Screen Shot 2016-02-20 at 6.46.39 PM

Step 8: Creating External tables in Greenplum Database

Now login to Greenplum Sandbox VM using “gpadmin” login and execute below commands at command line.

Note: Read Greenplum Database documentation on how to login to Greenplum Sandbox VM from command prompt. This is what I did “ssh gpadmin@192.168.221.168” at my command prompt and enter the password provided in Greenplum Sandbox VM. Please note that 192.168.221.168 is the IP address assigned by local VMWare to this VM and this might be different in your case.)

Below are the commands/steps to create a database and external tables in Greenplum Sandbox VM –

  1. ssh gpadmin@192.168.221.168 -> you will be landed into “gpadmin” home folder
  2. ~/start_all.sh <- this is to ensure Greenplum database is up and running
  3. createdb tpcds_data_db <- run this command at Linux command line. This will create a Greenplum database
  4. psql -U gpadmin tpcds_data_db <- Using Postgres command line interface login as gpadmin user and connect to tpcds_data_db database. Run this command at the Linux command line
  5. Now you will be in Postgresh CLI interface and directly connected to tpcds_data_db database
  6. Now create Greenplum External table using below command
    1. CREATE EXTERNAL TABLE reason  ( r_reason_sk int,r_reason_id text, r_reason_desc text)  LOCATION(‘gphdfs://192.168.2.44:8020/user/gpadmin/tpcds-data/2/reason/*’) FORMAT ‘text’ (delimiter ‘|’) ENCODING ‘UTF8’;
  7. Observe above we specified NameNode IP Address as “192.168.2.44”
  8. Now execute a SELECT query on Greenplum DB “reason” by executing following query in Postgres command line client
    1. \connect tpcds_data_db;
    2. select * from reason;
    3. You will get an error message “ERROR:  external table gphdfs protocol command ended with error.
    4. Study next “Step” below to understand this error

Screen Shot 2016-02-20 at 7.22.32 PM

Screen Shot 2016-02-20 at 7.36.08 PM

Step 9: Understanding the error shown in Step 8

It is very hard to understand the above error message. To understand the error one should know how HDFS client reads a HDFS file. I recommend reader of this blog post to study the HDFS client read flow. For the context of this post I just highlight that when a client reads a HDFS file NameNode will send to the client IP address of Data Node where HDFS file block exists, using that Data Node IP client will connect directly to the Data Node and read that block.

In this context HDFS client is gphdfs protocol.

When Greenplum Database external gphdfs protocol reads the HDFS file to execute external table SELECT query Cloudera Quickstart VM’s NameNode is returning Data Node IP address as 127.0.0.1 to Greenplum database gphfs protocol client. So Greenplum Database is trying to read the HDFS data block from this 127.0.0.1 IP address.

To let HDFS client not to use IP address of Data Node instead use the host name of Data Node we need to add following property in Greenplum Sandbox VM hdfs-site.xml

Before we proceed I assume you installed the Cloudera CDH client manually as described in above steps, if not I recommend to do that otherwise below steps will not work.

For more information read this link dfs.client.use.datanode.hostname changes

We need to update the hdfs-site.xml with following property and here are the steps (you need to do these steps in Greenplum Sandbox VM)

  1. sudo vi /etc/hadoop/conf/hdfs-site.xml
  2. <property>
            <name>dfs.client.use.datanode.hostname</name>
            <value>true</value>
    </property>
    Screen Shot 2016-02-20 at 8.31.08 PM

Step 10: Re-executing the SELECT statement after hdfs-site.xml changes

Now connect to Greenplum psql client and re-execute the SELECT statement. Here are the steps to do it –

  1. psql -u gpadmin tpcds_data_db
  2. select * from reason

SUCCESS!!!

You can see Greenplum able to read the HDFS data through its gphdfs protocol 🙂

Screen Shot 2016-02-20 at 8.33.45 PM

Step 10: Re-creating the Greenplum “reason” table

If you observe in the above console window though the Greeplum database SELECT statement able to read the HDFS data it throwed an error because in HDFS “reason” data file last column is suffixed with “|” character.

To overcome this problem we need to re-create the “reason” table with a dummy column in the end so that Greenplum database will not get any error

Execute below commands by connecting to Greenplum psql command line interface

  1. psql -U gpadmin tpcds_data_db
  2. drop external table reason;
  3. CREATE EXTERNAL TABLE reason  ( r_reason_sk int,r_reason_id text, r_reason_desc text, last_dummy_column text)  LOCATION(‘gphdfs://192.168.2.44:8020/user/gpadmin/tpcds-data/2/reason/*’) FORMAT ‘text’ (delimiter ‘|’) ENCODING ‘UTF8’;
  4. select * from reason
  5. SUCCESS!!!

Screen Shot 2016-02-20 at 9.00.08 PM

With this we are done with Cloudera Quickstart VM and Greenplum Sandbox VM integration!

Hope you enjoyed the post 🙂