As part of this section we will get into all important HDFS commands.
- Getting list of commands and help
- Creating directories and changing ownership
- Managing files and file permissions
- Controlling access using ACLs
- Overriding Properties
- HDFS usage and Metadata Commands
- Creating Snapshots
- Using CLI for administration
We are deliberately not sharing command snippets here for most of the topics. We would recommend to watch video and practice commands by looking into help of the command.
Let us start our servers in cloud environment and then services using Cloudera Manager before practicing the commands as demonstrated.
Getting list of commands and help
Let us explore details about how to list the commands and get the help or usage for given command.
- Even though we can run commands from almost all the nodes in the clusters, we should only use Gateway to run HDFS Commands.
- First we need to make sure designated Gateway server is Gateway for HDFS service so that we can run commands from Gateway node. In our case we have designated bigdataserver-1 as Gateway.
- Let us make sure that bigdataserver-1 is added as HDFS Gateway so that we can run our commands successfully.
- Also we can run commands by connecting to multiple clusters. However, we cannot configure one server as Gateway for multiple clusters and hence we have to specify the URI for Namenode using -fs. We can get Namenode URI from core-site.xml or Cloudera Manager.
- Typically Namenode process will be running on port number 8020.
hadoop fs– list all the commands available
hadoop fs -usage– will give us basic usage for given command
hadoop fs -help– will give us additional information for all the commands
- We can run help on individual commands as well.
- Let us also review very important command
hadoop fs -lsto list files and directories under given path
Creating Directories and Changing Ownership
Now let us have a look at how to create directories and manage ownership.
- By default hdfs is superuser of HDFS
hadoop fs -mkdir– to create directories
hadoop fs -chown– to change ownership of files
- chown can also be used to change the group. We can change the group using -chgrp command as well. Make sure to run -help on chgrp and check the details.
- Creating user space
- Create directory with user id cloudera under /user
- Change ownership to the same name as the directory created earlier ( /user/cloudera )
- You can validate permissions by using hadoop fs -ls command on /user
- Let us create OS users on bigdataserver-1 and then user spaces for cloudera , itversity and demo .
- We will be using these to demonstrate ACLs.
Managing Files and File Permissions
Now let us get into commands related to managing files in HDFS. It includes deleting files, copying files as well as HDFS File Permissions.
Deleting Files from HDFS
Let us see how we can delete files from HDFS.
- As we have already copied data into HDFS, let us start with deleting files using
hadoop fs -rmcommand.
- When we use rm command, files will be copied to .Trash directory by default. It acts as recycle bin to overcome issue of deleting files accidentally.
- We can use -skipTrash to skip recycle bin and delete data permanently. However, it cannot be undone.
- .Trash can be cleaned up manually by users belonging to superuser group such as HDFS or automatically based on trash related properties defined in core-site.xml.
Copying Files between local file system and HDFS
We can copy files from local file system and vice versa. We can append data into existing files in HDFS.
hadoop fs -copyFromLocalor
hadoop fs -put– to copy files from local filesystem and HDFS. Process of copying data is already covered. File will be divided into blocks and will be stored on Datanodes in distributed fashion based on block size and replication factor.
hadoop fs -copyToLocalor
hadoop fs -get– to copy files from HDFS to local filesystem. It will read all the blocks using index in sequence and construct the file in local file system.
- We can also use
hadoop fs -appendToFileto append data to existing file.
- However, we will not be able to update or fix data in files when they are in HDFS. If we have to fix any data, we have to move file to local file system, fix data and then again copy to HDFS.
- We can move files from local file system to HDFS using
hadoop fs -moveFromLocal. Even though there is a command moveToLocal, functionality is not implemented yet.
Copying or Moving Files within HDFS
We can also copy files with in HDFS using commands like cp and mv.
hadoop fs -cpto copy files from one HDFS location to another HDFS location
hadoop fs -mvto move files from one HDFS location to another HDFS location
- mv is faster than cp as mv deals with only metadata where as cp have to copy all the blocks.
- If you have to rename or move the files make sure to run
hadoop fs -mv
Let us see how we can preview the data in HDFS.
- If we are dealing with files contain text data (files of text file format), we can preview contents of the files using different commands as -tail , -cat etc.
- -tail can be used to preview last 1 KB of the file
- -cat can be used to print the whole contents of the file on the screen. Be careful while using -cat as it will take a while for even medium sized files .
- If you want to get first few lines from file you can redirect output of hadoop fs -cat to Linux more command
HDFS File Permissions
Let us go through file permissions in HDFS.
- As we create the files, we can check the permissions on them using -ls command.
- Typically the owner of the user space will have rwx , while members of the group specified as well as others have rx
- We can change the permissions using hadoop fs -chmod
- We can specify permissions mode (e.g.: +x to grant execute access to owner, grop as well as others) as well as octal mode (e.g.: 755 to grant rwx for owner, rx for group and others)
If you are not familiar with linux command chmod, we would highly recommend you to spend some time to get detailed understanding of it as it is very important with respect to file permissions.
Let us copy data into all 3 user spaces for the users cloudera , itversity and demo .
Reminder: If you want to take the break, make sure to stop Services and Cloudera Management Service in Cloudera Manager and then stop servers in the Cloud Platform.
Controlling Access using ACLs
ACLs stands for Access Control Lists and it gives finer level access control over files. Without ACLs permissions are controlled at owner, group and others levels only.
To use ACLs in HDFS, we need to set dfs.namenode.acls.enabled to true as part of hdfs-site.xml.
We can use
hadoop fs -setfaclto set ACL at file or directory level.
hadoop fs -getfaclcan be used to get details about ACL on a file or a directory.
First, let us see examples at the file level, then directory level and then deleting ACLs.
Let us see how we can override properties while running commands such as hadoop fs . Let us first review properties files such as core-site.xml and hdfs-site.xml.
- We can override any non-final property using -Dproperty_name=property_value as part of hadoop fs command.
- We can also use options such as -fs to override Namenode URI
- We can also change replication factor using subcommand -setrep
- Some of the properties might have been defined as final as part of the properties files such as core-site.xml or hdfs-site.xml. -D will not have any impact in that case.
- When it comes to Cloudera Manager, we are not supposed to override properties by updating files – instead, we need to use something called Safety Valve that comes as part of Cloudera Manager.
HDFS usage commands and getting metadata
Now let us have a look at HDFS usage commands and also commands used to get the metadata.
hadoop fs -df– to get details about the amount of storage used by HDFS. Use -s to get summarized information and -h to get information in readable format.
hadoop fs -du– to get the size of data that is copied
hdfs fsck– to get metadata for given directory or files.
HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system. Some common use cases of snapshots are data backup, protection against user errors and disaster recovery.
- It does not copy actual data. It will keep track of changes to metadata.
- First, we need to make the directory snapshottable – using hdfs dfsadmin -allowSnapshot . Only users in supergroup can allow snapshots on a directory.
- Once snapshots are allowed, we can create snapshot using hadoop fs -createSnapshot
- We can also delete, rename the snapshot using deleteSnapshot or renameSnapshot
- Users in supergroup can also disallow snapshot (using hdfs dfsadmin -disallowSnaphsot )
Using CLI for administration
There are several commands to perform administration using CLI. We need to use HDFS super user to manage HDFS cluster using commands. In our case it is hdfs itself.
- Formatting Namenode
- Rolling Edits
- Save Namespace (create fsimage)
- Enter or Leave Safemode
- Running balancer
- Running file system utility (fsck)
- and many more
By this time you should have set up Cloudera Manager, then install Cloudera Distribution of Hadoop, Configure services such as Zookeeper and HDFS. Also you should have good grasp about the HDFS commands.
Make sure to stop services in Cloudera Manager and also shutdown servers provisioned from GCP or AWS to leverage credits or control costs.