How to Deploy MapD Community Edition on OpenShift Origin
Try HeavyIQ Conversational Analytics on 400 million tweets
Download HEAVY.AI Free, a full-featured version available for use at no cost.
GET FREE LICENSEOpenShift is Red Hat’s enterprise-grade container platform based on Docker, Kubernetes and Red Hat Enterprise Linux. Origin is the upstream, open source, community version of OpenShift, so pairing it with the MapD Community Edition feels like a natural fit!
In this blog post, I will show you how to deploy MapD on top of OpenShift Origin. The MapD Community Edition is available as Docker images for two hardware configurations – one for a CPU-only node and the other for GPU-enabled. For this example, I leveraged shared storage to operate on the same backend MapD database using either the CPU or GPU node with its corresponding docker image. The ability to switch between a CPU and GPU node allows us to use the CPU image for tasks that do not demand a lot of performance, and use the GPU image for running SQL queries and rendering complicated charts in milliseconds.
Setting up the infrastructure
For testing MapD on OpenShift Origin, I used Amazon EC2 cloud, but the instructions are also applicable to an on-premise deployment of OpenShift. I followed the instructions in Sysdig’s tutorial How to deploy OpenShift on AWS to set up a minimalistic OpenShift Origin cluster. I modified the CloudFormation script provided in this tutorial to launch the following configuration in AWS:
- 1 t2.medium CentOS 7 based master node
- 1 p2.xlarge CentOS 7 based worker node that has a GPU (Tesla-K80)
- 1 t2.large CentOS 7 based worker node that has CPU only
- A separate VPC and subnets to provide logical network isolation for the OpenShift cluster
- Security groups that will open the following public ports:
- 22 SSH for all host
- 8443 for the OpenShift Web console, master node
- 10250 master proxy to node hosts, master node
- 9090 thru 9092 for MapD
In a production environment, the amount of data you can process with MapD Core depends primarily on the amount of GPU RAM and CPU RAM available on a MapD worker node. The MapD Hardware Configuration Reference Guide will help you size the MapD node for both on-premise as well as cloud deployment.
As I want to access the same database from both the CPU- and GPU-based worker node, I decided to set up shared storage based on Amazon EFS (Elastic File System). EFS allows the nodes to mount using the NFSv4 protocol. I created the EFS file system mount target in the same VPC as the OpenShift cluster.
Deploying OpenShift Using Ansible
The OpenShift Origin project uses Ansible playbooks to automate the installation. I used the OpenShift master node as the Ansible server which needs to be set up with several required packages.
Login to the Master node and install Ansible:
$ sudo yum install epel-release -y
$ sudo yum update –y
$ sudo yum install ansible -y
$ sudo yum install git -y
I used the OpenShift Ansible GitHub project version 3.9, which contains the Ansible roles and playbooks to install and manage OpenShift clusters.
$ git clone https://github.com/openshift/openshift-ansible.git
$ cd openshift-ansible
$ git checkout release-3.9
$ cd ..
On the Master node, I downloaded the project:
I ran the Ansible playbook prepare.yml to do some of the pre-configuration of the CentOS hosts. Ansible uses the hosts inventory file to set the names of the master and worker nodes correctly. In this setup, the master node runs all of the components of the master controller and also hosts the etcd key-value store.
As you can see, the hosts file has the same server for both the master and etcd:
...
[masters]
ip-10-0-0-6.ec2.internal
[etcd]
ip-10-0-0-6.ec2.internal
[nodes]
ip-10-0-0-6.ec2.internal openshift_node_labels="{'region':'infra','zone':'east'}" openshift_schedulable=true
ip-10-0-0-4.ec2.internal openshift_node_labels="{'region': 'primary', 'zone': 'east'}"
ip-10-0-0-10.ec2.internal openshift_node_labels="{'region': 'primary', 'zone': 'east'}
I ran the pre-configuration Ansible playbook:
$ ansible-playbook prepare.yml -i ./hosts --key-file mapd-east1.pem
Then, I applied the OpenShift installation playbook to the configured nodes
The OpenShift cluster became available once the playbook completed execution successfully, and I could then access the web console.
The final step in setting up the cluster was to create a user account and password:
$ sudo htpasswd -b /etc/openshift/openshift-passwd admin MapD1@
With a browser, I accessed the OpenShift Console at the master node’s public IP address using port 8443. You can login as admin with password MapD1@.
On the Master Node (ip0-10-0-0-6), I logged in as admin into the default project using the oc command line utility, and confirmed that the master and worker nodes were in a ready state:
$ oc login -u system:admin
$ oc project default
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-0-10.ec2.internal Ready compute 13d v1.9.1+a0ce1bc657
ip-10-0-0-4.ec2.internal Ready compute 13d v1.9.1+a0ce1bc657
ip-10-0-0-6.ec2.internal Ready master 13d v1.9.1+a0ce1bc657
In my setup, ip-10-0-0-10 is the GPU node and ip-10-0-0-4 is the CPU-only node, with the nodes labeled accordingly:
$ oc label node ip-10-0-0-4.ec2.internal nodetype=cpu
$ oc label node ip-10-0-0-10.ec2.internal nodetype=gp
GPU Setup - NVIDIA Driver Installation
To use the Tesla-K80 GPU on the OpenShift node, I had to install the NVIDIA drivers and NVIDIA docker container runtime modules. The operating system on the node with the GPU is CentOS Linux release 7.3.1611 (Core). I installed the Extra Packages for Enterprise Linux (EPEL) repository, because RHEL-based distributions require Dynamic Kernel Module Support (DKMS) to build the GPU driver kernel modules.
I performed the following commands on the GPU node:
$ sudo yum install epel-release
Then I installed the latest kernel for which the headers are available and reboot the node.
$ sudo yum install epel-release
After reboot, I installed the kernel headers and CUDA drivers. The CUDA platform gives direct access to the GPU virtual instruction set and parallel computation elements.
I labeled the installed NVIDIA files with the correct SELinux label and rebooted the system to ensure that all changes were active.
$ sudo chcon -t container_file_t /dev/nvidia* // this alone did not work
$ sudo reboot
After reboot, I confirmed that the NVIDIA drivers were loaded and that the GPU was recognized.
$ lsmod | grep nvidia
$ nvidia-smi
Deploying the GPU Version of MapD Community Edition
These are the steps for installing MapD Community Edition as a Docker container on the OpenShift node running on a Tesla-K80 GPU. The image mapd/mapd-ce-cuda is optimized to run on the CUDA platform and is available on the Docker hub. I created the YAML file deploy_mapd_gpu.yml for launching the pod with the MapD docker image:
After I created the MapD pod, I checked the status to make sure it was running:
I then got details of the launched MapD pod and ran the nvidia-smi management and monitoring command line utility to make sure that the MapD process was running on the GPU:
I accessed the command line in the MapD docker image to examine the processes and ran MapD utilities. Notice that the MapD server and MapD web server were launched automatically and the database was stored on the NFS shared storage mounted at /mapd-storage.
To verify that the system was working, I loaded some sample data, ran a SQL query using mapdql command line utility, and finally generated a scatter plot using MapD Immerse. MapD ships with two sample datasets of airline flight information collected in 2008, and one dataset of New York City census information collected in 2015. To install the sample data, I ran the insert_sample_data command and chose option 2 for inserting 10k rows of Flights data.
I connected to MapD Core using mapdql, using the default password “HyperInteractive”:
Then I listed the tables in the database:
I printed the schema for the flights table. The fields used in the SQL query are listed below:
I then ran a SQL query on the table to find flights where the distance between the origin and destination cities were less than 175 miles:
In order to access MapD Immerse through the web interface, I started an OpenShift Service with a NodePort attached to the MapD pod’s port 9092 (Immerse). OpenShift will transparently route incoming traffic on the NodePort to the service irrespective of which node you connect to, even if the pod is not running on that node. The service is defined in mapd_service.yml:
I created a MapD service and confirmed that it is available:
With a web browser connected to Immerse, using one of the node’s external (public) IP address on port 30092 (http://54.164.113.161:30092), I created a dashboard with a scatterplot using the newly ingested flights_2008_10k table and then followed these steps to create a scatterplot:
Clicked DASHBOARDS -> New Dashboard -> Add Chart -> SCATTER
Clicked SOURCES, chose flights_2008_10k table as the data source
Clicked MEASURES X Axis -> + Add Measure, chose depdelay
Clicked MEASURES Y Axis -> + Add Measure, chose arrdelay
The resulting chart shows, unsurprisingly, that there is a correlation between departure delay and arrival delay. Finally, by clicking Apply and Save, I created the dashboard with the name Flights Dashboard.
Before I switched over to using the CPU only version of the MapD docker image, I deleted the MapD service and pod.
$ oc delete service mapd-service
service "mapd-service" deleted
$ oc delete pod mapd
pod "mapd" deleted
Deploying CPU Version of MapD Community Edition
To deploy the CPU-only version of the MapD Community Edition, I used the pod definition YAML file deploy_mapd_cpu.yml for installing MapD Community Edition as a Docker container on the CPU-only OpenShift node. Notice that the main difference from the GPU version are the docker image and nodetype. The CPU version of the MapD pod used the same EFS shared storage, so it comes up pointing to the same database that I worked on using the GPU pod.
kind: Pod
metadata:
name: mapd
labels:
app: mapd
spec:
nodeSelector:
# Node with CPU
nodetype: cpu
containers:
- name: mapd
# MapD docker CPU image
image: mapd/mapd-ce-cpu
volumeMounts:
- mountPath: "/mapd-storage"
name: mystor
ports:
-name: mapd-port0
containerPort: 9090
-name: mapd-port1
containerPort: 9091
-name: mapd-port2
containerPort: 9092
volumes:
- name: mystor
nfs:
server: fs-2bc69063.efs.us-east-1.amazonaws.com
path: /
The service YAML file is the same that I used for launching the GPU pod, and I launched the MapD pod and service.
$ oc create -f deploy_mapd_cpu.yml
$ oc create -f mapd_service.yml
After confirming that the pod was running, I opened the browser to connect to Immerse using one of the node’s external (public) IP address on port 30092. As the CPU pod is accessing the same database that was created by the GPU pod, I was able to open the same dashboard, “Flights Dashboard,” that I created using the GPU pod. Please note that MapD in CPU-only configuration disables backend rendering with limited access to compute-intensive charts.
Conclusion
OpenShift is the leading open source container management platform with advanced features and a large community of developers and users. MapD is also emerging as an indispensable open source analytics tool for data scientists that provides unprecedented interactivity with massive datasets. MapD on OpenShift makes deployment, management, and scaling easy across different infrastructures, from a physical, virtual, public, private, or hybrid cloud.