CCA-175 (28-08-21) failed

Hi,
This friday I took the exam and failed :frowning:
BUT! I was able to copy paste all the exam question:

Your welcome :wink:

CCA175: Spark and Hadoop Developer

You are logged into a single node cluster as the cert user. For each scenario below, the Output Requirements describe the criteria that will be used to grade the exam. Spark 2.4 is installed on the cluster.

Copy and Paste

It is recommended to use the drop-down menu in the browser and the terminal to copy and paste. Copy/Paste keystrokes may cause repeating key problems on the remote desktop.

Desktop

There is a Change Resolution icon on the desktop for changing screen size. The Pluma icon launches a text editor.

Documentation

The Firefox browser has bookmarks to documentation. You may not use any other documentation or resources.
WARNING: The use of any other website will result in an exam failure. In particular using Google search will cause the exam to be marked as a failure.


Problem 1

Calculate how many customers live in each state of the country.

Data Description

Customer records are stored in the HDFS directory /user/cert/problem1/data/customer/ in a tab-delimited text format.

The files contain the following columns and types:

Column Type
id int
fname string
lname string
address string
city string
state string
zip string

Output Requirements

  • Place the result data in the HDFS directory /user/cert/problem1/solution/
  • Use a text format with a tab as the column delimiter
  • The first column should be the state, and the second should be the total of all customers that live in that state

Sample Results

ME 288165
MI 877860
SC 398638


Problem 2

Retrieve billing records that have a large charge associated with them and store those records as compressed Parquet files.

Data Description

There are billing records stored in a metastore table billing in the problem2 database.

Column Type
id int
charge float
code string
tstamp string

Output Requirements

  • Place the result data in the HDFS directory /user/cert/problem2/solution/
  • Only retrieve billing records that have a charge greater than $10.00
  • The files should use the Parquet file format with gzip compression
  • The schema of the Parquet file should be the same as the input metastore table

Sample Results

5103830 19.41 X1 03/29/15 18:39:34
5456102 13.77 GA18 03/29/15 18:40:33
5481343 10.33 BB2 03/29/15 18:40:21


Problem 3

Convert customer data into a new file format to improve query performance.

Data Description

Customer records are stored in the HDFS directory
/user/cert/problem3/data/customer/

Output Requirements

  • Place the result data in the HDFS directory /user/cert/problem3/solution/
  • The data should be stored in Snappy compressed Parquet files
  • The output should contain the following columns
Column Type
id string
fname string
lname string
street string
city string
state string
zip string


Problem 4

For security purposes, your company needs to refer to employees on particular forms without using their full name. Create a new dataset that stores an employee alias.

Data Description

Employee records are stored in the HDFS directory
/user/cert/problem4/data/employee/ in a Parquet file format.

Output Requirements

  • Place the result data in the HDFS directory /user/cert/problem4/solution/
  • The data should be stored in Snappy compressed Parquet files
  • Create a column called "alias" by taking the first letter of the first name and appending the last name
  • Write out the employee id, first name, last name, and alias

Sample Results

7499998 Brittany Hewitt BHewitt
7499999 Carol Vazquez-Santana CVazquez-Santana
7500000 Matthew Holmes MHolmes


Problem 5

Convert existing customer JSON files into a compressed Avro file format.

Data Description

There are 25 million records stored in the HDFS directory
/user/cert/problem5/data/customer/ in the JSON file format. Each record contains fourteen columns.

Output Requirements

  • Place the result files in the HDFS directory /user/cert/problem5/solution/
  • The solution files should use the Avro file format with Snappy compression
  • The schema of the Avro records should be the same as the input JSON files


    <H2>Problem 6</H2>

Loudacre Mobile is concerned about phones overheating. Sensor data is captured from each device in the network on a regular basis. Find the average temperature for each model of phone.

Data Description

Phone sensor records are stored in the HDFS directory
/user/cert/problem6/data/sensor/

Column Type
Timestamp int
Customer ID int
Phone ID string
Phone Model string
Latitude float
Longitude float
Firmware Version int
Bluetooth Status int
GPS Status int
WiFi Status int
Battery Remaining float
Temperature float
Signal Strength float

Output Requirements

  • Place the result data in the HDFS directory /user/cert/problem6/solution/
  • The report should be in comma-delimited text format
  • Store the phone model and average temperature

Sample Results

SMGRZ 41.3556451744487
VV0N7 44.2236067977499
NKIAM20 46.1234567890123


Problem 7

Generate a report of all customers sorted by last name.

Data Description

Customer records are stored in the HDFS directory
/user/cert/problem7/data/customer/ in a Parquet file format.

Output Requirements

  • Place the result data in the HDFS directory /user/cert/problem7/solution/
  • Use an ORC file format with Snappy compression for the output files
  • The output should contain the customer's last name and the customer's first name
  • Sort by last names; it is not necessary to sort identical last names secondarily by first name

Sample Results

Abbott Cindy
Baker Mark
Cortez Ryan


Problem 8

Billing data needs to be converted in a high performance file format and stored in a metastore table so that your coworkers have access to run efficient Impala queries.

Data Description

Billing records are stored in the HDFS directory
/user/cert/problem8/data/billing/ as tab-delimted text files.

Output Requirements

  • Create a metastore table named solution in the problem8 database
  • The data should be stored as Snappy compressed Parquet files
  • The table schema should be
Column Type
id int
charge double
code string
time string


Problem 9

Create a report of bills owed by customers.

Data Description

Customer records are stored in the HDFS directory
/user/cert/problem9/data/customer/ in a tab-delimited text format.

Column
id
fname
lname
address
city
state
zip

Billing records are stored in the HDFS directory /user/cert/problem9/data/billing/ in a tab-delimited text format. The custid field is a foreign key to the customer that owns the bill.

Column
custid
amount
code
billdate

Output Requirements

  • Place the result data in the HDFS directory /user/cert/problem9/solution/
  • Use a text format with a tab as the columnar delimiter
  • The first column should be the customer's full name (first, space, last)
  • The second column should be the amount of money of a single billing transaction

Sample Results

Gwendolyn Ware 0.37
Juan Gibson 2.00
James Perez 1.69



Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs

  • Click here for access to state of the art 13 node Hadoop and Spark Cluster