Configure the HDFS and Understand Concepts

#1

As part of this section we will see how to set up HDFS components such as Namenode, Secondary Namenode, Datanodes etc while exploring some of the key concepts of this very important service.

  • Setup HDFS
  • Copy Data into HDFS
  • Components of HDFS
  • Configuration Files and Important Properties
  • Review Web UIs and log files
  • Checkpointing and Namenode Recovery
  • Configure Rack Awareness

Cluster Topology

We are setting up cluster on 7+1 nodes. We start with 7 nodes and then we will add one more node later.

  • Gateway(s) and Management Service
    • bigdataserver-1
  • Masters
    • bigdataserver-2 – Zookeeper, Namenode
    • bigdataserver-3 – Zookeeper, Secondary Namenode
    • bigdataserver-4 – Zookeeper
  • Slaves or Worker Nodes
    • bigdataserver-5 – Datanode
    • bigdataserver-6 – Datanode
    • bigdataserver-7 – Datanode
  • We will create host group hdfs to run commands using ansible on all nodes where HDFS is running.

Learning Process

We will follow the same standard process to learn while adding any software-based service.

  • Downloading and Installing – already taken care as part of adding hosts to the cluster.
  • Configuration – we need to understand architecture and plan for the configuration.
    • Architecture – Master, Helper, and Slaves
    • Components – Namenode, Secondary Namenode, and Datanodes
    • Configuration Files – /etc/hadoop/conf
    • With cloudera the location is a bit different and we will see it after setting up the service.
  • Service logs/var/log/hadoop-hdfs
  • Service Data – Different locations for different components of HDFS controlled by different properties such as dfs.data.dir , dfs.name.dir etc

Setup HDFS

Here are the steps involved in setting up HDFS using Cloudera Manager.

  • Choose the dropdown of cluster Cluster 1
  • Click on Add Service
  • Choose HDFS
  • Configure Namenode – bigdataserver-2
  • Configure Secondary Namenode – bigdataserver-3
  • Configure Data Nodes – bigdataserver-5 to bigdataserver-7
  • Once the HDFS is started, click on the instances and choose Add Role Instances
  • Configure bigdataserver-1 as gateway node.
  • Gateways facilitate us to connect to HDFS cluster and issue commands to manage files in HDFS.
  • Let us also review few things as part of the set up process.

https://gist.githubusercontent.com/dgadiraju/4d2dd360d5853f2ed1222e5b6a9f7e83/raw/8d0aea1f73a3a8806e9c40e3ffc863799f259966/cdh-admin-review-hdfs-setup.sh

Copy Data into HDFS

Let us copy data into HDFS and understand details about how data is stored in HDFS.

  • hadoop fs is the main command using which we can copy and manage files in HDFS.

https://gist.githubusercontent.com/dgadiraju/3ef7a526ff1b047f5955ff61bfd9c38c/raw/e35b5ae68bbb2d2575e4ff1a11ee353fe7f952fe/cdh-admin-setup-data.sh

  • A file will be divided into blocks (by default 128 MB) and those blocks will be physically stored as part of servers where datanode process is running.
  • For example, a 1 GB file will be divided into 8 blocks of 128 MB each and a 200 MB file will be divided into 2 blocks of size 128 MB and 72 MB respectively.
  • If the file size is less than block size, then each file will be stored in 1 block of its size.
  • There will be multiple copies of each block for fault tolerance. It is controlled by replication factor.
  • Metadata – details about files and blocks.
    • There is block metadata and file metadata
    • At block level we will have mapping between file, block id, block location etc
    • At file level we will have file path, file permissions etc
    • Metadata is managed as part of memory in the server where Namenode is running. It is also called as Namenode heap (bigdataserver-2 in our case)

Components of HDFS

Now let us explore different components related to HDFS and how they are used to store files in HDFS.

  • We need to configure the Namenode, Secondary Namenode, Datanodes and balancer etc as part of HDFS. Here Namenode act as master and Secondary Namenode as helper for Namenode where as Datanodes act as slaves.
  • Actual data will be stored in Datanodes in the form of blocks while metadata is stored in Namenode’s heap (memory)
  • Click here for a nice article which explains details with respect to directory structure on datanode data directories.
  • Here is the pictorial representation about how HDFS stores, reads and writes files.

  • Anatomy of File Write in HDFS

  • Anatomy of File Read in HDFS

  • HDFS uses checksum to ensure data integrity of the files. Client will compute checksum for each block and matches with the checksum returned by Datanodes after copying the blocks on to them.
  • Metadata is logged into edit logs and also periodic snapshot known as fsimage is built by merging edit logs into fsimage.
  • Changes to metadata is also logged into edit logs under directory defined as part of dfs.name.dir. If there are no edit logs then we will not be able to recover metadata if Namenode crashes as memory is transient.
  • These edit logs are periodically merged into fsimage so that Namenode metadata can be recovered faster.
  • This process of merging edit logs into last fsimage and creating new fsimage is called as checkpointing (Secondary Namenode takes care of checkpointing)
  • We will get into details of edit logs and fsimage, and how they are used for recovering metadata at a later point in time.
  • Datanodes send heartbeat to Namenode at frequent and regular intervals. As part of the heartbeat, additional information about available storage will be sent to Namenode. Also periodically block report from each Datanode is sent to Namenode.
  • A balancer is a service which keeps track of data nodes to distribute data appropriately across all the nodes.
  • Namenode Web Interface : http://:50070

Features of HDFS

  • Fault tolerant – HDFS uses mirroring at block level and dfs.replication controls how many copies should be made.
    • Traditionally we use RAID for fault tolerance of Hard Drive failures in Network Storage.
    • In Hadoop replication takes care of not only Hard Drive failures but also any other hardware failure which might result in server crash or planned maintenance.
    • By default if there are m nodes and n is replication factor where m > n, the cluster survive n-1 node failures.
  • Logical Distributed File System – The files are divided into blocks based upon the dfs.blocksize and stored in the servers designated as data nodes.
  • Rack Awareness
    • Replication Factor can cover failure of n-1 nodes for any serious reason at individual server level.
    • We use redundant network to cover any network cable related issues.
    • However, if there is network switch failure or complete rack failure where servers are hosted – then there will be outage at the cluster level.
    • To overcome this we can configure Rack Awareness script. For this servers need to be configured on multiple racks behind 2 network switches.
    • We need to come up with strategy and then rack awareness script in such a way that at least one copy will be made in each of the racks in a multi rack Hadoop cluster behind multiple network switches (typically 2 or more).
    • We will see how to configure rack awareness script at a later point in time.

Configuration files and Important Properties

Unlike plain vanilla distribution and other vendor distributions, Cloudera manages configuration files a bit different. Typically configuration files will be in /etc/hadoop/conf. But when it comes to Cloudera, /etc/hadoop/conf will only have templates. Actual properties files are managed under /var/run/cloudera-scm-agent/process on each node.

  • hadoop-env.sh is for environment variables
    • HADOOP_HOME
    • JAVA_HOME
    • HADOOP_HEAPSIZE – default heap size for Hadoop components such as Namenode, Secondary Namenode, Datanode etc.
    • HADOOP_DATANODE_OPTS – JVM settings for Datanode. We can override memory settings for Datanode here.
    • HADOOP_NAMENODE_INIT_HEAPSIZE – Override Namenode heap size. We can also use OPTS for Namenode (similar is the case with Secondary Namenode).
    • As highlighted earlier, each file as well as block will have metadata associated with them. Size of each entry will be 150 bytes. Replication is not included while counting earlier. With replication it will take 150 bytes times replication factor for each block.
    • You can actually see more details from this Cloudera article about Namenode heap sizing.
  • core-site.xml will have parameters that are used by HDFS and MapReduce
    • fs.defaultFS
    • fs.trash.interval
    • proxy user configuration
    • io.compression.codecs
    • net.topology.script.file.name
  • hdfs-site.xml will have parameters that are used by HDFS
    • dfs.blocksize
    • dfs.replication
    • dfs.client.read.shortcircuit
    • dfs.namenode.http-address
    • dfs.datanode.http.address
    • dfs.datanode.data.dir

We can have http address bound to 0.0.0.0, if we want to access the URL using any of the ip addresses that are assigned to the server.

https://gist.githubusercontent.com/dgadiraju/0bc896a2aa7fc7b99a1b6fc52c4dcc4a/raw/35ea5255d6de84875028414674595cb9325c5886/cdh-admin-important-properties.csv

Review Web UIs and log files

There are several Web UIs you need to be aware of with respect to HDFS.

  • Namenode Web UI
  • Datanode Web UI
  • Using Cloudera Manager to troubleshoot issues
  • Log files related to the service

Checkpointing and Namenode Recovery

At times Namenode might be down due to planned maintenance or unplanned outages. As administrators it is very important for us to understand recovery process of Namenode.

  • Let us start with Namenode metadata (information about files and blocks).
    • There is metadata related to blocks as well as files.
    • At block level we will have mapping between file, block id, block location etc
    • At file level we will have file path, file permissions etc
    • Metadata is managed as part of memory in the server where Namenode is running. It is also called as Namenode heap (bigdataserver-2 in our case)
    • Changes to metadata is also logged into editlogs and periodic snapshots of metadata is used to create fsimage.
    • dfs.name.dir/dfs.namenode.name.dir and dfs.namenode.checkpoint.dir are the properties which control the location of fsimage and edit logs on Namenode and Secondary Namenode respectively.
  • edit logs – It contains sequence of changes made to the filesystem by all the client applications since last checkpoint. As part of block metadata, only files and associated block ids are captured (not block locations)
  • fsimage – It is periodic snapshot of the filesystem (metadata) since namenode is started. At regular intervals new fsimage will be created by merging edit logs since last fsimage into last fsimage.
  • Only few edit logs and couple of fsimages will be retained and older ones will be deleted.

Checkpointing

Process of creating new fsimage at regular intervals is known as checkpointing .

  • A new fsimage is created on namenode when it is formatted and added.
  • Changes to metadata will be logged in edit logs on Namenode.
  • Secondary Namenode takes care of checkpointing. It will take the last fsimage (from Namenode or Secondary Namenode) and latest edit logs from Namenode since the last checkpoint and create new fsimage.
  • Once the checkpointing is done, fsimage will be copied back to Namenode (controlled by dfs.name.dir )
  • dfs.namenode.checkpoint.period (Default – 1 hour) – Maximum delay between two consecutive checkpoints
  • dfs.namenode.checkpoint.txns (Default 1 Million)- Defines the number of transactions in Editlogs which will force checkpointing
  • Either of the events will trigger checkpointing.
  • Once checkpointing is done fsimage will be copied back to Namenode directory controlled by dfs.name.dir
  • You can also perform manual checkpointing
    • On Namenode you need to first save namespace so that fsImage is created (in safe mode). We can use hdfs dfsadmin for entering into safe mode, create FSImage and then leave safe mode.
    • On Secondary Namenode you can run this command to force checkpoint – hadoop secondarynamenode -checkpoint force ``
    • We can also enter into safe mode and create FSImage using Cloudera Manager as depicted here. Click on HDFS on Dashboard/Home Page and then Namenode on the summary. You will see the options under Actions as highlighted below.

It is good practice to configure Namenode to store multiple copies of fsimage and edit logs. Also these multiple copies should be persisted on different hard drives. One of the hard drive should be external to the server. By doing this even if Namenode goes down completely we should be able to recover metadata onto different server.

Namenode Recovery Process

Now let us get into the details about Namenode Recovery Process. Namenode recovery happens in safe mode.

  • Automated Process
    • Let us assume that namenode is restarted (we can do it using Cloudera Manager web interface)
    • It will start in safe mode. In safe mode, we will not be able to add files to HDFS, but we can read data from some of the files which are being recovered.
    • Namenode will pick the latest fsimage from dfs.name.dir and will restore in the memory. Now Namenode will have metadata upto point in time when the fsimage is created.
    • Changes to metadata will be available as part of edit logs created since last checkpoint and hence Namenode will play back those edit logs.
    • However fsimage and edit logs will only contain partial metadata such as files and blocks, but not block locations.
    • As Namenode comes up, datanodes send the information of block locations to Namenode.
    • Once fsimage is restored, edit logs are played back and information of block locations are updated, Namenode will come out of safe mode.
  • Manual Process (in case of corruption or missing files or directories on namenode) – Login to Namenode and run this command as hdfs hadoop namenode -recover

Configure Rack Awareness

As highlighted earlier replication will cover all system failures and redundant networking will cover cable and network card failures. But it will not cover the rack maintenance or network switch failures. Rack is nothing but container where we place multiple servers in it.

Also replication can only cover n-1 nodes in a m node cluster where m > n and n is replication factor.

In the above mentioned cases (such as rack maintenance/outage as well as more than n nodes being down) entire cluster might have to be down. One way to address this issue is to have nodes in multiple racks and ensure that at least one copy of data will present in each of the rack.

To ensure blocks are copied to nodes in more than one rack, we need to make the cluster rack aware by assigning host to rack group using a script and a file which have mapping between node and logical group id of rack.

If the cluster is setup in 2 rack groups, we divide nodes in 2:1 (one rack group should have twice as many nodes as other rack).

We can assign the rack to the nodes in the cluster either using Cloudera Manager or Custom Script.

  • When managed by distributions like Cloudera it is recommended to use Cloudera Manager.
  • In either of the approach we will have Rack Topology Script and a file with mapping between node and logical rack group id.
  • We should not randomly assign rack group id. All the servers in a rack should be pointing to same rack group id.
  • As we have only 3 nodes, let us assume that bigdataserver-5 and bigdataserver-6 are on one rack group and other node on other rack group.

Using Cloudera Manager

  • Go to HDFS and review HDFS Components
  • Click on Hosts and go to Hosts
  • All the Hosts will be shown
  • Choose bigdataserver-5 and bigdataserver-6
  • Go to Actions -> Assign rack and type /rack1. For production clusters, you need to pass actual information of rack details based up on the hardware in your enterprise.
  • Repeat the step for bigdataserver-7 and assign it to /rack2.
  • Deploy and Restart the service.

Using Script

If we plan to use the external script file, we can write shell or python script and give the script file path in HDFS configuration for the parameter “net.topology.script.file.name”.

  • Create rack-topology.sh and rack-topology.data under /etc/hadoop/conf
  • Define ‘net.topology.script.file.name’ in core-site.xml with value /etc/hadoop/conf/rack-topology.sh
  • We need to have mapping file in the location as referred by the script. In the below script, it is using rack-topology.data in the current directory itself and hence we need to have file with mapping information under /etc/hadoop/conf itself.
  • Restart HDFS using commands

Refer below sample scripts that needs to be copied to Hadoop Configuaration directory (/etc/hadoop/conf/)

https://gist.githubusercontent.com/dgadiraju/717d674bd7a72e2ca538407f1486652a/raw/17773e2a738aed5d30b605a5883739c650631332/rack-topology-01.sh

https://gist.githubusercontent.com/dgadiraju/717d674bd7a72e2ca538407f1486652a/raw/17773e2a738aed5d30b605a5883739c650631332/rack_topology-02.data

By this time you should have set up Cloudera Manager, then install Cloudera Distribution of Hadoop, Configure services such as Zookeeper and HDFS. Also you should have understood some of the important concepts (if not all). If there are any issues, please reach out to us.

Make sure to stop services in Cloudera Manager and also shutdown servers provisioned from GCP or AWS to leverage credits or control costs.

0 Likes