Hi,
This friday I took the exam and failed …
BUT! I was able to copy paste all the exam question:
Your welcome
CCA175: Spark and Hadoop Developer
You are logged into a single node cluster as the cert
user. For each scenario below, the Output Requirements describe the criteria that will be used to grade the exam. Spark 2.4 is installed on the cluster.
Copy and Paste
It is recommended to use the drop-down menu in the browser and the terminal to copy and paste. Copy/Paste keystrokes may cause repeating key problems on the remote desktop.
Desktop
There is a Change Resolution icon on the desktop for changing screen size. The Pluma icon launches a text editor.
Documentation
The Firefox browser has bookmarks to documentation. You may not use any other documentation or resources.
WARNING: The use of any other website will result in an exam failure. In particular using Google search will cause the exam to be marked as a failure.
Problem 1
Calculate how many customers live in each state of the country.
Data Description
Customer records are stored in the HDFS directory
/user/cert/problem1/data/customer/
in a tab-delimited text format.
The files contain the following columns and types:
Column | Type |
---|---|
id | int |
fname | string |
lname | string |
address | string |
city | string |
state | string |
zip | string |
Output Requirements
- Place the result data in the HDFS directory
/user/cert/problem1/solution/
- Use a text format with a tab as the column delimiter
- The first column should be the state, and the second should be the total of all customers that live in that state
Sample Results
ME | 288165 |
MI | 877860 |
SC | 398638 |
Problem 2
Retrieve billing records that have a large charge associated with them and store those records as compressed Parquet files.
Data Description
There are billing records stored in a metastore table billing
in the problem2
database.
Column | Type |
---|---|
id | int |
charge | float |
code | string |
tstamp | string |
Output Requirements
- Place the result data in the HDFS directory
/user/cert/problem2/solution/
- Only retrieve billing records that have a charge greater than $10.00
- The files should use the Parquet file format with gzip compression
- The schema of the Parquet file should be the same as the input metastore table
Sample Results
5103830 | 19.41 | X1 | 03/29/15 18:39:34 |
5456102 | 13.77 | GA18 | 03/29/15 18:40:33 |
5481343 | 10.33 | BB2 | 03/29/15 18:40:21 |
Problem 3
Convert customer data into a new file format to improve query performance.
Data Description
Customer records are stored in the HDFS directory
/user/cert/problem3/data/customer/
Output Requirements
- Place the result data in the HDFS directory
/user/cert/problem3/solution/
- The data should be stored in Snappy compressed Parquet files
- The output should contain the following columns
Column | Type |
---|---|
id | string |
fname | string |
lname | string |
street | string |
city | string |
state | string |
zip | string |
Problem 4
For security purposes, your company needs to refer to employees on particular forms without using their full name. Create a new dataset that stores an employee alias.
Data Description
Employee records are stored in the HDFS directory
/user/cert/problem4/data/employee/
in a Parquet file format.
Output Requirements
- Place the result data in the HDFS directory
/user/cert/problem4/solution/
- The data should be stored in Snappy compressed Parquet files
- Create a column called "alias" by taking the first letter of the first name and appending the last name
- Write out the employee id, first name, last name, and alias
Sample Results
7499998 | Brittany | Hewitt | BHewitt |
7499999 | Carol | Vazquez-Santana | CVazquez-Santana |
7500000 | Matthew | Holmes | MHolmes |
Problem 5
Convert existing customer JSON files into a compressed Avro file format.
Data Description
There are 25 million records stored in the HDFS directory
/user/cert/problem5/data/customer/
in the JSON file format. Each record contains fourteen columns.
Output Requirements
- Place the result files in the HDFS directory
/user/cert/problem5/solution/
- The solution files should use the Avro file format with Snappy compression
- The schema of the Avro records should be the same as the input JSON files
<H2>Problem 6</H2>
Loudacre Mobile is concerned about phones overheating. Sensor data is captured from each device in the network on a regular basis. Find the average temperature for each model of phone.
Data Description
Phone sensor records are stored in the HDFS directory
/user/cert/problem6/data/sensor/
Column | Type |
---|---|
Timestamp | int |
Customer ID | int |
Phone ID | string |
Phone Model | string |
Latitude | float |
Longitude | float |
Firmware Version | int |
Bluetooth Status | int |
GPS Status | int |
WiFi Status | int |
Battery Remaining | float |
Temperature | float |
Signal Strength | float |
Output Requirements
- Place the result data in the HDFS directory
/user/cert/problem6/solution/
- The report should be in comma-delimited text format
- Store the phone model and average temperature
Sample Results
SMGRZ | 41.3556451744487 |
VV0N7 | 44.2236067977499 |
NKIAM20 | 46.1234567890123 |
Problem 7
Generate a report of all customers sorted by last name.
Data Description
Customer records are stored in the HDFS directory
/user/cert/problem7/data/customer/
in a Parquet file format.
Output Requirements
- Place the result data in the HDFS directory
/user/cert/problem7/solution/
- Use an ORC file format with Snappy compression for the output files
- The output should contain the customer's last name and the customer's first name
- Sort by last names; it is not necessary to sort identical last names secondarily by first name
Sample Results
Abbott | Cindy |
Baker | Mark |
Cortez | Ryan |
Problem 8
Billing data needs to be converted in a high performance file format and stored in a metastore table so that your coworkers have access to run efficient Impala queries.
Data Description
Billing records are stored in the HDFS directory
/user/cert/problem8/data/billing/
as tab-delimted text files.
Output Requirements
- Create a metastore table named
solution
in theproblem8
database - The data should be stored as Snappy compressed Parquet files
- The table schema should be
Column | Type |
---|---|
id | int |
charge | double |
code | string |
time | string |
Problem 9
Create a report of bills owed by customers.
Data Description
Customer records are stored in the HDFS directory
/user/cert/problem9/data/customer/
in a tab-delimited text format.
Column |
---|
id |
fname |
lname |
address |
city |
state |
zip |
Billing records are stored in the HDFS directory
/user/cert/problem9/data/billing/
in a tab-delimited text format.
The custid
field is a foreign key to the customer that owns the bill.
Column |
---|
custid |
amount |
code |
billdate |
Output Requirements
- Place the result data in the HDFS directory
/user/cert/problem9/solution/
- Use a text format with a tab as the columnar delimiter
- The first column should be the customer's full name (first, space, last)
- The second column should be the amount of money of a single billing transaction
Sample Results
Gwendolyn Ware | 0.37 |
Juan Gibson | 2.00 |
James Perez | 1.69 |
Learn Spark 1.6.x or Spark 2.x on our state of the art big data labs
- Click here for access to state of the art 13 node Hadoop and Spark Cluster