Purpose of this integration :
I take this opportunity to share my experiences in integrating Greenplum Sandbox VM with Cloudera Quickstart VM.
Purpose of this integration is to create Greenplum external tables using its gphdfs protocol and accessing the data in HDFS.
While there are many articles/videos helping community on how to setup Cloudera VM or Greenplum VM this blog post importance is on how to integrate Cloudera Quickstart VM with Greenplum Sandbox VM.
If you know how to setup Cloudera Quickstart VM and Greenplum Sandbox VM then you can directly to Step 4 below i.e. installing CDH manually in Greenplum Sandbox VM.
What you need for this integration:
- Oracle Virtual Box
- Cloudera Quickstart VM
- Greenplum Sandbox VM
- TPC-DS dataset (you need Internet connection for this)
- Understanding of Oracle Virutalbox 4 network adapters configuration
At the time of this article I used following versions of Greenplum Sandbox and Cloudera Quickstart VM :-
Greenplum VM – PivotalGPDB-Sandbox-Centos-6.7-x86_64-vmware
Cloudera VM – cloudera-quickstart-vm-5.4.2-0-virtualbox
Step 1: Attach Cloudera Quickstart VM to Oracle Virtual Box
- Download latest Cloudera Quickstart VM from cloudera.com site and attach it to Oracle VirtualBox
- Before your start Cloudera Quickstart VM right click on that VM in Oracle Virtualbox and click on Settings (check below image)
- In settings context menu select Network and add all 4 network adapters
- Adapter 1 -> NAT
- Adapter 2 -> Host-only Adapter
- Adapter 3 -> Internal Network
- Adapter 4 -> Bridge Adapter (this might not work if you are trying from your office because this will try to get an IP Address from your network router. If it does not work then do not set this up)
- After above settings Start the Cloudera Quickstart VM in Oracle Virtual Box
- Now click on terminal/shell icon within the Cloudera Quickstart Vm and type the command ifconfig to know its IP addresses.
- In my case its IP address is 192.168.2.44 (I mean Adapter 4 IP address i.e. Bridge Adapter IP address)
Step 2: Attach Greenplum Sandbox VM to VMWare Fusion/Player
- Download Greenplum Sandbox VM from Greeplum database site. http://greenplum.org/
- Follow steps detailed in Greeplum site on how to attach Greenplum Sandbox VM and start the Greenplum Sandbox VM – https://network.pivotal.io/products/pivotal-gpdb#/releases/567/file_groups/337
- You don’t need to setup network adapters for this Greenplum Sandbox VM because we need data movement (for Greenplum external tables) from Cloudera Quickstart VM to Greenplum Sandbox VM and not the other way.
- This means Greenplum Sandbox VM should be able to ping or access the Cloudera Quickstart VM. For this reason we don’t need to specify the network adapters for Greenplum Sandbox VM
- Start the Greenplum Sandbox VM in VMWare and also start the Greenplum database as detailed in above Pivotal GPDB release link
- Once you login as gpadmin in Greenplum Sandbox VM and you see command line in Greenplum Sandbox VM type the ifconfig command and note the IP address of Greenplum Sandbox VM
- In my case its (i.e. Greenplum Sandbox VM) IP address is 192.168.221.168
Step 3: Install Oracle JDK in Greenplum Sandbox VM
I found that Greenplum Sandbox VM uses JDK 1.6 since I am using latest version of Cloudera CDH below I decided to upgrade to JDK 1.8.
Follow the steps detailed in this link on how to install Oracle JDK 1.8 in Greenplum Sandbox VM – http://tecadmin.net/install-java-8-on-centos-rhel-and-fedora/#
Note: Somehow when I executed the command mentioned in “Setup PATH Variable” in above link i.e. setting the JDK path in PATH variable. For this reason I did not executed or made PATH variable changes.
Step 4: Install CDH manually in Greenplum Sandbox VM
For gphdfs protocol to communicate with Cloudera Quickstart VM while querying the external tables with data in HDFS we need to first provide location of CDH installation. Since Greenplum Sandbox VM does not have CDH installed we need to follow the Manual installation steps detailed in cloudera.com
At the time of this writing I installed CDH 5.5 manually and followed the steps (manual installation) as it is from this link CDH 5.5 Manual Installation
Here are the high level steps I executed to install CDH 5.5 in Greenplum Sandbox VM:
NOTE: You need to install / execute these steps in Greenplum Sandbox VM.
- Downloaded the link from “To download and install the CDH 5 “1-click Install” package: RHEL/CentOS/Oracle 6 link”
- Install the RPM for all RHEL versions:
- Step 2: Optionally Add a Repository Key – For RHEL/CentOS/Oracle 6 systems:
- Step 3: Install CDH 5 with YARN – Install and deploy ZooKeeper.
- Step 4: Install each type of daemon package on the appropriate systems(s), as follows. – RHEL/CentOS compatible
- Step 4: Install CDH 5 with MRv1 – you need to run this and only run the “JobTracker host running: RHEL/CentOS compatible” – I found I need to install even MRV1 jar files. Do not run other Hadoop components like NameNode etc for MRV1
- Then you need immediately jump to Step 6 – Deploying CDH section. Specifically you need to go to this link
Step 5: Create gpadmin Linux user & HDFS home in Cloudera Quickstart VM (optional)
This is an optional step, I just created to ensure Greenplum Sandbox VM Linux user (gpadmin) accessing the HDFS data through Greenplum External tables has a Linux login even in Cloudera Quickstart VM
In Cloudera Quickstart VM execute these steps (reason why I mentioned sudo in the beginning of every command is – assuming you are in cloudera Linux user in Cloudera Quickstart VM)
- sudo groupadd gpadmin
- sudo useradd gpadmin -g gpadmin
- sudo passwd gpadmin
- <give a simple password for this login in Cloudera Quickstart VM)
- Now create a HDFS home directory for this gpadmin user
- sudo -u hdfs hadoop fs -mkdir -p /user/gpadmin
- sudo -u hdfs hadoop fs -chmod -R 755 /user/gpadmin
- sudo -u hdfs hadoop fs -chown -R gpadmin:gpadmin /user/gpadmin
Step 6: Populate TPC-DS dataset in Cloudera Quickstart VM Hive
I have plans to run TPC-DS queries later (not in these VM’s though 🙂 ) so just to test Greenplum external tables with data in HDFS I populated TPC-DS 1 GB dataset in Cloudera Quickstart VM.
You can download the scripts from below link. In case below link is broken just Google for “hive tpc-ds testbench” https://github.com/hortonworks/hive-testbench
This is what I did in Cloudera Quickstart VM:
- Login as cloudera user in Cloudera Quickstart VM
- Create a directory for TPC-DS Zip file and its build by running below commands
- sudo mkdir -p /usr/lib/tpcds
- sudo chmod -R 777 /usr/lib/tpcds
- (I understand 777 is not a good practice but this is just a VM and learning exercise so I relaxed on best practice)
- Now login as gpadmin user by running this command
- su – gpadmin
- <Enter the gpadmin user password you have set above>
- Assuming you logged in as “gpadmin” user run below command
- cd /usr/lib/tpcds
- Run this command
- wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip
- Somehow when I downloaded above file it stored as hive4 file without any extension, to fix it I executed below command
- mv hive14 hive14.zip
- Unzip the file i.e.
- unzip hive14.zip
- Then run these commands
- Now create a HDFS directory as a base path for TPCDS dataset
- Assuming you logged in as gpadmin Linux user in Cloudera Quickstart VM execute below command
hadoop fs -mkdir -p /user/gpadmin/tpcds-data
- Now generate a 2GB dataset ( I found TPC-DS scale-factor to be more than 1 for this reason creating 2GB dataset)
- cd /usr/lib/hive-testbench-hive14
hadoop fs -mkdir -p /user/gpadmin/tpcds-data
- hadoop fs -chmod -R 755 /user/gpadmin/tpcds-data
- ./tpcds-setup 2 /user/gpadmin/tpcds-data
- <Above command will take at least 9 minutes to populate dataset>
- <Same command will try to optimize the data>
- <Don’t want until optimization is complete. Just observer for below console output>
- TPC-DS text data generation complete.
- Loading text data into external tables.
- Optimizing table store_sales (1/24).
- <Once you see the above 3 lines of code let the command continue but you can continue with Greenplum External tables setup step>
Step 7: Know about TPC-DS Hive tables
After you run above TPC-DS Hive testbench scripts you will find few Hive tables created in Cloudera Quickstart VM Hive.
In your local browser enter this link http://192.168.2.44:8888/beeswax/#query
(You need to change the IP address above from 192.168.2.44 to whatever is appropriate for you. This IP address is Cloudera Quickstart VM’s Virtual Box Adapter 4 Network address)
Important Note: Database names are prefixed with dataset total size you specified in the above tpcds-setup.sh shell script. Since I specified 2GB the database name created is tpcds_text_2, if you specified other tpc-ds dataset total size value then database name will be prefixed with that value.
For this blog post I am considering tpcds_text_2 database and “reason” table. Since “reason” table has very few columns I selected this table 🙂
Step 8: Creating External tables in Greenplum Database
Now login to Greenplum Sandbox VM using “gpadmin” login and execute below commands at command line.
Note: Read Greenplum Database documentation on how to login to Greenplum Sandbox VM from command prompt. This is what I did “ssh firstname.lastname@example.org” at my command prompt and enter the password provided in Greenplum Sandbox VM. Please note that 192.168.221.168 is the IP address assigned by local VMWare to this VM and this might be different in your case.)
Below are the commands/steps to create a database and external tables in Greenplum Sandbox VM –
- ssh email@example.com -> you will be landed into “gpadmin” home folder
- ~/start_all.sh <- this is to ensure Greenplum database is up and running
- createdb tpcds_data_db <- run this command at Linux command line. This will create a Greenplum database
- psql -U gpadmin tpcds_data_db <- Using Postgres command line interface login as gpadmin user and connect to tpcds_data_db database. Run this command at the Linux command line
- Now you will be in Postgresh CLI interface and directly connected to tpcds_data_db database
- Now create Greenplum External table using below command
- CREATE EXTERNAL TABLE reason ( r_reason_sk int,r_reason_id text, r_reason_desc text) LOCATION(‘gphdfs://192.168.2.44:8020/user/gpadmin/tpcds-data/2/reason/*’) FORMAT ‘text’ (delimiter ‘|’) ENCODING ‘UTF8’;
- Observe above we specified NameNode IP Address as “192.168.2.44”
- Now execute a SELECT query on Greenplum DB “reason” by executing following query in Postgres command line client
- \connect tpcds_data_db;
- select * from reason;
- You will get an error message “ERROR: external table gphdfs protocol command ended with error.“
- Study next “Step” below to understand this error
Step 9: Understanding the error shown in Step 8
It is very hard to understand the above error message. To understand the error one should know how HDFS client reads a HDFS file. I recommend reader of this blog post to study the HDFS client read flow. For the context of this post I just highlight that when a client reads a HDFS file NameNode will send to the client IP address of Data Node where HDFS file block exists, using that Data Node IP client will connect directly to the Data Node and read that block.
In this context HDFS client is gphdfs protocol.
When Greenplum Database external gphdfs protocol reads the HDFS file to execute external table SELECT query Cloudera Quickstart VM’s NameNode is returning Data Node IP address as 127.0.0.1 to Greenplum database gphfs protocol client. So Greenplum Database is trying to read the HDFS data block from this 127.0.0.1 IP address.
To let HDFS client not to use IP address of Data Node instead use the host name of Data Node we need to add following property in Greenplum Sandbox VM hdfs-site.xml
Before we proceed I assume you installed the Cloudera CDH client manually as described in above steps, if not I recommend to do that otherwise below steps will not work.
For more information read this link dfs.client.use.datanode.hostname changes
We need to update the hdfs-site.xml with following property and here are the steps (you need to do these steps in Greenplum Sandbox VM)
- sudo vi /etc/hadoop/conf/hdfs-site.xml
Step 10: Re-executing the SELECT statement after hdfs-site.xml changes
Now connect to Greenplum psql client and re-execute the SELECT statement. Here are the steps to do it –
- psql -u gpadmin tpcds_data_db
- select * from reason
You can see Greenplum able to read the HDFS data through its gphdfs protocol 🙂
Step 10: Re-creating the Greenplum “reason” table
If you observe in the above console window though the Greeplum database SELECT statement able to read the HDFS data it throwed an error because in HDFS “reason” data file last column is suffixed with “|” character.
To overcome this problem we need to re-create the “reason” table with a dummy column in the end so that Greenplum database will not get any error
Execute below commands by connecting to Greenplum psql command line interface
- psql -U gpadmin tpcds_data_db
- drop external table reason;
- CREATE EXTERNAL TABLE reason ( r_reason_sk int,r_reason_id text, r_reason_desc text, last_dummy_column text) LOCATION(‘gphdfs://192.168.2.44:8020/user/gpadmin/tpcds-data/2/reason/*’) FORMAT ‘text’ (delimiter ‘|’) ENCODING ‘UTF8’;
- select * from reason
With this we are done with Cloudera Quickstart VM and Greenplum Sandbox VM integration!
Hope you enjoyed the post 🙂