Secure the Clusters


Enable relevant services and configure the cluster to meet goals defined by security policy; demonstrate knowledge of basic security practices

  • Configure HDFS ACLs
  • Install and configure Sentry
  • Configure Hue user authorization and authentication
  • Enable/configure log and query redaction
  • Create Encrypted Zones in HDFS

Configure HDFS ACLs

ACLs stands for Access Control Lists. It is primarily to provide finer level access to HDFS. Concept is inherited from Linux. We need to understand how to Configure HDFS ACLs using Cloudera Manager.

This is covered extensively as part of Install CM and CDH – Important HDFS Commands . There are several topics related to ACLs.

Install and Configure Sentry

Let us see the details about how to install and configure sentry.

  • Sentry service is an RPC server that stores authorization metadata in an underlying relational database.
  • It provides RPC interfaces to retrieve and manipulate privileges.
  • We can integrate with Kerberos for Security.
  • The service serves authorization metadata from the database backed storage; it does not handle actual privilege validation.
  • The Hive, Impala, and Solr services are clients of this service.
  • Sentry privileges are enforced when they are configured to use Sentry.
  • Java must be installed on all client nodes and configure $JAVA_HOME
  • Make sure cluster is running and Kerberised or for testing purpose without Kerberos set sentry.hive.testing.mode to true once Sentry service is added.
    • Cloudera Manager -> Hive -> Configuration -> Sentry Service Advanced Configuration Snippet (Safety Valve) for sentry-site.xml
  • To define a role and give privileges mapped to a user group, make sure that user group created on all the nodes.

We can use Cloudera Manager to setup Sentry.

  • Make sure Database is created for Sentry. We have MySQL running on bigdataserver-1, let us setup database by name sentry in it.

  • We also need to make sure that mysql-connector-java is installed on the node where we are going to configure Sentry Server . In our case it is bigdataserver-4. We have already installed earlier and can validate by running ls -ltr /usr/share/java/mysql-connector-java.jar
  • In case if you could not find MySQL Connector, we can install by using sudo yum -y install mysql-connector-java
  • Go to Add Service -> Choose Sentry
  • Choose Sentry Server (bigdataserver-4) and Gateway (bigdataserver-1)
  • Add Database details
  • Complete Setup Process

Configure Sentry

We can use Sentry with different high-level services such as Hive, Impala and Hue.

  • Changing the Hive Warehouse permissions
  • Disable Impersonation
  • Make sure system users such as Hive, Impala can run YARN jobs in the cluster.
  • Block Hive CLI access
  • Enable Sentry in the Hive and Impala
  • Enable Sentry in Hue
  • Add Sentry Admin Group

Enabling the Sentry Service for Hive, Impala and Hue

We will setup Sentry for all 3 services.

Let us see how to configure Hive to use Sentry for authentication and authorization.

  • Typically we give 777 permissions on /user/hive/warehouse and enable impersonation so that the usernames are listed as actual users who are running Hive Queries even though the queries are run by Hive user itself.
  • With Sentry, the queries will be submitted by actual users itself and hence we need to disable impersonation.
  • Changing the permissions and ownership for warehouse directory
    sudo -u hdfs hdfs dfs -chmod -R 775 /user/hive/warehouse
    sudo -u hdfs hdfs dfs -chown -R hive:hive /user/hive/warehouse
  • Disable Impersonation – To run the jobs from Hue as a Hive user instead of the individual user identities for YARN. Enabling HiveServer2 impersonation bypasses Sentry from the end-to-end authorization process. Let us see how to disable impersonation for HiveServer2 in the Cloudera Manager Admin Console
    • Go to the Hive service -> Configuration tab.
    • Select Scope -> HiveServer2 & Category -> Main.
    • Uncheck the HiveServer2 Enable Impersonation checkbox.
    • Click Save Changes to commit the changes.
  • Enable System Users to Submit YARN jobs – Since we have disabled hive impersonation, now we will make sure to add hive user in YARN configuration to be able to submit jobs.
    • If you are using YARN, to enable the Hive user to submit YARN jobs.
    • Go to the YARN service -> Configuration tab.
    • Select Scope -> NodeManager & Category -> Security .
    • Ensure the Allowed System Users property includes the hive user. If not, add hive.
    • Click Save Changes to commit the changes.
    • Repeat steps 1-6 for every NodeManager role group for the YARN service that is associated with Hive.
    • Restart the YARN service.
  • Block Hive CLI Access – This is used to block Hive CLI access to regular users who are not part of groups such as Hive and Hue.
    • Go to Hive service -> Configuration tab.
    • Locate the hadoop.proxyuser.hive.groups parameter and click the plus sign.
    • Enter hive into the text box and click the plus sign again.
    • Enter hue into the text box and the sentry also.
    • Click Save Changes
  • Here we will be configuring the hive service to use the sentry.
    • Go to the Hive service.
    • Click the Configuration tab.
    • Select Scope -> Hive (Service-Wide).
    • Select Category- > Main.
    • Locate the Sentry Service property and select Sentry.
    • If there is any validation error to be fixed, click on the error and check “Enable Stored Notifications in Database”.
    • Click Save Changes to commit the changes.
    • Restart the Hive service.

Note: Make sure to set sentry.hive.testing.mode to true.


This step is to enable sentry privileges for the Impala service.

  • Go to the Impala service.
  • Click the Configuration tab.
  • Locate the Sentry Service property and select Sentry.
  • Click Save Changes to commit the changes.
  • Restart the Impala service.

Sentry privileges will be enabled to determine which Hive / Impala databases and tables a user can see or modify from the Hue. The user who is logging into the Hue must have equivalent OS-level user account on all hosts to authenticate the user. And the user group also should be as the user group to whom privileges are given.

  • Go to the Hue service.
  • Click the Configuration tab.
  • Select Scope -> Hue (Service-Wide).
  • Select Category -> Main.
  • Locate the Sentry Service property and select Sentry.
  • Click Save Changes to commit the changes.
  • Restart Hue.

Add Sentry Admin Group

We can add the group in which users who are part of the specific group can create roles and corresponding privileges.

  • Go to the Sentry service.
  • Click the Configuration tab.
  • Locate the “Admin Groups” property and add the group of users (E.g.:sentryadmin) who can be the Sentry admin.
  • Click Save Changes to deploy client configuaration.

Once you are done with the configurations, you can create the roles and privileges for the users.

Creating a user in sentryadmin group

Here we will adding itversity user to sentryadmin group who can create the roles.

Creating Roles and Grant appropriate Permissions

Launch to beeline shell as SentryAdmin user

beeline !connect jdbc:hive2://bigdataserver-4.c.smooth-unison-219405.internal:10000/default

Give the username and password as itversity and user password. Then we will log in as sentry admin user who can create roles and privileges.

  • To create an admin role who can access all the databases on the Hive Server

Create role admin GRANT ALL ON SERVER server1 TO ROLE admin; GRANT ROLE admin TO GROUP sentryadmin;

  • To create developers role who can access to specific DB (retail_db) in this case.

Create role developers; GRANT ALL ON DATABASE retail_db TO ROLE developers; GRANT ROLE developers TO GROUP developers;

Once you are done with creating the roles, you can log in to beeline shell as one user who is in part of the developers group and check the access.

Configure Hue user authorization and authentication

We have already seen how to setup Hue, create users and groups in Hue as part of Install Other Components – Configure Hadoop Ecosystem components . It contain 2 topics where all the components are added to the cluster using Hue and then Overview about Hue groups and users.

Enable or Configure Log and Query Redaction

Handling Personally identifiable information like credit card numbers, passport numbers data is very important for the enterprises. Cloudera introduced a feature called Sensitive Data Redaction from version 5.4.0 to manage the sensitive information in HDFS. E.g.: Sensitive Data Redaction will get credit-card numbers out of log files and SQL queries.

In Cloudera Manager there are two new parameters in HDFS, one to enable redaction and one to specify what to redact. Redaction is an HDFS parameter that is applied to the whole cluster.

Let us validate how queries are logged into /tmp/itversity/hive.log by running hive queries from the command prompt.

To enable and configure redaction:

  • Go to HDFS -> Configuration
  • Search for “redaction”
  • Check or Click on “Enable Log and Query redaction”
  • And then to add policies, Click on + sign in “Log and Query Redaction Policy” section to add the rules.
  • The rule will have the following fields to configure.
    • Description – To name the rule, no impact on the redaction process with this field.
    • Trigger – Used for a simple string matches (not a regular expression) that to redact the Search regular expression is applied.
    • Search – It is a regular expression to find the data and replace with a certain string.
    • Replace – Here we will put the text to be replaced that found by search field.
  • Now to test redaction rules, E.g.: To redact email address, you can enter the email address and select “Test Redaction”. The output will be displayed by the replacement string that is configured.
  • Click Save Changes to commit the changes.
  • Restart the cluster.

To ensure data is redacted as expected, make sure to run hive queries once again and review the log file.

Create Encrypted Zones in HDFS

Let us understand few details about Encryption, enable encryption and then how we can create encrypted zones in HDFS.

  • When we say Data Encryption we are actually talking about business data that is stored in HDFS. This is different from log and query redaction.
  • Data encryption is mandatory for many government, financial, and regulatory entities, worldwide, to meet privacy and other security requirement.
  • Examples: Credit Card Payment Companies have to be comply with PCI DSS, Insurance and other Health Care companies need to be comply with HIPAA, etc.
  • Encrypting data stored in HDFS can help your organization comply with such regulations.

Key Capabilities

Let us go through some of the key capabilities of HDFS Encryption.

  • HDFS Clients can Encrypt or Decrypt the data.
  • Encryption and Decryption require a key. Such key management is external to HDFS. We will get the list of Cloudera Provided Key Management Solutions as part of the Wizard.
    • Cloudera Navigator Key Trustee Server
    • A file-based password-protected Java KeyStore
  • We will be using Java KeyStore approach for now.
  • HDFS uses the Advanced Encryption Standard-Counter mode (AES-CTR) encryption algorithm. AES-CTR supports a 128-bit encryption key (default).
  • It also supports 256-bit encryption key when Java Cryptography Extension (JCE) is installed.
  • HDFS Encryption can take the advantage of Hardware Encryption Accelerators such as AES-NI Instruction Set.

Enabling HDFS Encryption Using Wizard

As the curriculum talk about creating zones only we will take the simplest path to enable encryption and focus on creating encryption zones.

  • Go to the Wizard Cluster -> Set up HDFS Data At Rest Encryption

  • There are several approaches to Enable Encryption. Out of all the options we will choose A file-based password-protected Java KeyStore
  • It will actually highlight the steps need to be performed
    • Enable Kerberos – Recommended
    • Enable TLS/SSL – Recommended
    • Add Java KeyStore KMS Service
    • Restart Stale Services and Redeploy Client Configuration
    • Validate Data Encryption
  • As Kerberos and TLS/SSL are not mandatory, we will not be setting up those for learning purpose at this time. However, in actual production clusters, we need to enable both.
  • We can click on the link Add Java KeyStore KMS Service in steps and add Java KeyStore for now.
    • Add Key Management Server – bigdataserver-4
    • Add Key Admin User and Group Use r – itversity
    • Click on Generate ACLs
    • Click Continue to save generated XML
    • Leave to defaults and Continue
  • Make sure services are restarted and client configurations are redeployed. Ensure that there are no icons to restart or redeploy before going for the next step.

Encryption Zones and Keys

As we have successfully enabled encryption let us go ahead and validate it. As part of the validation we will create encrypted zones as well.

  • Encrypted Zone is nothing but directory in HDFS. These directories can be managed by KEY_ADMIN_USER only (itversity in our case)
  • Now we can click on Validate step which will give instructions to validate. It will look like the image under the gist.
    • We can give any names for key and directory.
    • Create a key (mykey1) and directory (/tmp/zone1) as KEY_ADMIN_USER (itversity)
    • Create a zone and link to the key as Super User (hdfs)
    • Create a file locally and copy to encrypted zone as KEY_ADMIN_USER (itversity)
    • As Super User (hdfs) ensure file is encrypted by looking up into /.reserved/raw/ENCRYPTED_ZONE (/.reserved/raw/tmp/zone1)

As part of the validation, we have created keys and zones. Let us understand the concepts behind those.

  • Encryption Zone is a directory in HDFS whose contents will be automatically encrypted on write and decrypted on read.
  • Only users who own the key will be able to decrypt others cannot. In this case only itversity will be able to read contents of /tmp/zone1/helloWorld.txt
  • Encryption Zones start off as empty directories. If we have to encrypt data in bulk, we can use tools such as distcp .
  • Each encryption zone is associated with a key (EZ Key) specified by the key administrator when the zone is created.
  • Each file within an encryption zone has its own encryption key, called the Data Encryption Key (DEK).
  • These DEKs are encrypted with their respective encryption zone’s EZ key, to form an Encrypted Data Encryption Key (EDEK).
  • EDEKs are stored persistently on the NameNode as part of each file’s metadata, using HDFS extended attributes.
  • Extended Attributes are key/value pairs in which the values are optional; generally, the key and value sizes are limited to some implementation-specific limit. With out Extended Attributes , only filesize, permissions, modification dates are stored as part of file’s metadata . Those are called as fixed Attributes .

You can go to this detailed blog, to understand how Encryption actually works.