The Provision instances from Google Cloud


As part of this section we will talk about setting up necessary tools to create virtual machines on Google Cloud Platform (GCP)

  • Setup Ubuntu using Windows Subsystem
  • Sign up for GCP
  • Create template for Big Data Server
  • Provision Servers for Big Data Cluster
  • Review Concepts
  • Setting up gcloud
  • Setup ansible on first server
  • Format JBOD
  • Cluster Topology

By the end of this session you should have 8 servers on which you can setup multi node big data cluster.

Setup Ubuntu using Windows Subsystem

It is better to have Linux based environment for setting up Big Data Clusters. It provides us below capabilities.

  • Ability to connect to remote servers using ssh
  • Copy files using scp or rsync
  • Enable proxy such as sshuttle to access web applications running on the servers behind firewall with out opening up ports to the public using ssh authentication.

Setup Process

Here are the instructions related to setting up of Ubuntu using Windows Subsystem.

  • Open Powershell as Administrator and run this command – Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
  • Go to Windows Store and search for Ubuntu
  • Click on Launch and complete setup process by providing username and password.
  • Install sshuttle using apt-get – apt-get install sshuttle -y

Sign up for GCP

Let us sign up to GCP and start using it for setting up the clusters. We will be demonstrating using GCP quite extensively.

  • Home Page:
  • $300 credit for an year. This credit is enough to learn set up process for most of the clusters.
  • We will be able to stop the servers and start when ever we need. It will help us leverage credit for our learning purposes.
  • Setup process will take only few minutes for us to start provisioning servers (virtual machines).
  • Make sure to upgrade the account to use the $300 credit
  • Go to quotas and increase CPUs to 48 in the region of your choice for Compute Engine APIs – CPUs . In our case it is us-east1
    • Choose the metric in desired region
    • Enter phone number and click on Next
    • Enter quota to be 48 and click on Save.
    • It might take 2 days to increase the quota.
  • Pricing – let us go through the details related to pricing.
    • There are variable costs and fixed costs associated with GCP
    • VM Instances are charged as you use them
    • Once provisioned there is fixed cost with respect to storage as well as static ips.
    • You can use GCP pricing calculator to have idea about costs
    • Once $300 is saturated, you or your company have to pay for it.

Make sure to setup Cloud Console Mobile app on your iPhone or Android device so that you can monitor billing and also manage virtual machines.

Create template for Big Data Server

Let us create template for Big Data Server. We will be using this template to provision servers to build Big Data Cluster.

  • First check quotas for CPUs, use 2 vCPUs with 8 GB . If you want you can use higher configuration if the quota is higher than 24 CPUs in any one region.
  • You also need to consider other services running under your account.
  • Click on Instance Templates
  • Name: bigdataserver
  • Machine Type: 2 vCPUs
  • Boot disk: Select Centos 7 and increase size to 32 GB
  • Rest: Leave defaults

We can use this template to provision as many nodes as we want for our Big Data Cluster.

Provision Servers for Big Data Cluster

Now let us provision server for our Big Data Cluster.

  • Go to Instance Templates
  • Click on more options and then choose Create VM
  • Name: bigdataserver-1
  • Click on Create

Once the server is provisioned, we need to add additional disks for storage.

  • Click on bigdataserver-1
  • Click on Edit
  • Click on Add Item under Additional disks
  • Click on Create Disk
    • Name: bigdataserver-1-disk-1
    • Choose Blank Disk from Source type
    • Size: 60 (GB)
    • Click on Create
  • Now Click on Save under Edit page.

Repeat these steps 7 more times, so that we have 8 Servers with 8 external hard drives of size 60 GB each.

Review Concepts

Let us review some of the important concepts in Google Cloud Platform.

  • Instance Templates
  • Instance Groups
  • VM Instances
  • Boot Disks
  • Additional Storage or Disks
  • Internal IP and External IP
  • SSH Options – gcloud and regular SSH
  • Instance states – Start, Stop, Reset, Delete
  • Firewall Rules

Setting up gcloud

Once servers are created, we can connect to them on Google Cloud using gcloud and ssh using our terminal.

  • Click here to get instructions to setup gcloud
  • We will be using Debian/Ubuntu instructions (on Windows System).
  • Follow the instructions on the page. No need to install additional packages.
  • Make sure to run gcloud init
  • Use gcloud command from one of the hosts and connect to the host. It will generate new key with name google_compute_engine under ~/.ssh
  • We can use this to connect to the servers directly using ssh with external ip
  • For the first server set external ip as static
    • Go to Instance -> expand menu -> View Network Details
    • It will take us to VPC network
    • Click on External IP addresses
    • Change type for the first server (bigdataserver-1) from Ephemeral to Static
    • Provide Name and Description ( bigdataserver-1 )
    • Click on Reserve
    • There is nominal cost associated with the static ip
  • Make sure to copy private key on to the first server from which you want to manage all the other servers using meaningful names such as bigdataserver-2.
    • scp -i ~/.ssh/google_compute_engine ~/.ssh/google_compute_engine dgadiraju@
    • Once the private key is copied, we should be able to connect from bigdataserver-1 using command like this – ssh -i ~/ssh/google_compute_engine bigdataserver-2

Setup ansible on first server

We need to connect to all the instances in one shot to perform several tasks in setting up the cluster as well as managing the cluster. Ansible will facilitate us to connect to multiple instances and take care of common tasks on all the nodes.

Format JBOD

Now let us format the additional hard disks we have added to each of the hosts.

Cluster Topology

Let us get into the details related to cluster topology in a typical production cluster and also details about the servers we are going to use to learn the setup process.

A typical production cluster will contain 100s of nodes, out of which

  • Few servers which will contain databases to store data for different services
    • Hive
    • Oozie
    • Cloudera Management Service
    • and more
  • 1 or 2 servers to run management tools based on distribution.
    • Cloudera – Cloudera Manager and Cloudera Management Service
    • Hortonworks – Ambari and Ambari Metrics
    • MapR
    • and more
  • 2 or 3 will be categorized as Gateway Nodes
  • Handful of Master Nodes to run master processes. On smaller clusters we might deploy multiple master processes on some nodes.
    • 3 servers for Zookeeper
    • 2 servers for Namenode and Secondary Namenode or Active and Passive Namenode
    • 2 servers for Resource Manager and associated history servers
    • 1 server for Hive
    • 1 or 2 servers for Impala
    • 3 servers for HBase
    • and more
  • Rest of the nodes are categorized as worker nodes where we deploy slaves associated with all the services.
    • HDFS – Datanodes
    • YARN – Node Manager
    • HBase – Region Servers
    • Impala – Impalad
    • and more

To learn, we can start with 8 servers.

  • One server for Gateway and other Management services based on distribution
    • MySQL Server with multiple databases
    • Cloudera – Cloudera Manager and Cloudera Management Service Components
    • Hortonworks – Ambari and Ambari Metrics
    • or other distribution management tools
  • 3 Masters
    • Namenode and Secondary Namenode or Active and Passive Namenode
    • Resource Manager on 2 nodes and associated history servers
    • Zookeeper on all 3 masters
    • HBase Master on all 3 masters
    • and more
  • 3 + 1 Workers (we will start with 3 and add one later)
    • HDFS – Datanodes
    • YARN – Node Manager
    • HBase – Region Servers
    • and more

By now, you should have signed up google cloud account, understand GCP, create instance template, provision 8 servers, setup ansible on first server and then format and mount additional storage. Also you should have setup mobile app to monitor the usage of your credits as well as to manage servers.