Data Engineering Spark SQL - Spark SQL Functions - Query Example - Word Count

Let us see how we can perform word count using Spark SQL. Using word count as an example we will understand how we can come up with the solution using pre-defined functions available.

Explanation: The video provided at the link in YouTube complements this text by visually demonstrating the steps involved in performing word count using Spark SQL.

Key Concepts Explanation

Spark Context Setup

To begin with, set up the Spark context in the Notebook using the code snippet provided to execute the subsequent code.

val username = System.getProperty("user.name")
import org.apache.spark.sql.SparkSession

val spark = SparkSession.
    builder.
    config("spark.ui.port", "0").
    config("spark.sql.warehouse.dir", s"/user/${username}/warehouse").
    enableHiveSupport.
    appName(s"${username} | Spark SQL - Predefined Functions").
    master("yarn").
    getOrCreate

Using Spark SQL, Scala, and Pyspark

Execute Spark SQL, Scala, or Pyspark using the respective approaches mentioned below:

Using Spark SQL

spark2-sql \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Scala

spark2-shell \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Using Pyspark

pyspark2 \
    --master yarn \
    --conf spark.ui.port=0 \
    --conf spark.sql.warehouse.dir=/user/${USER}/warehouse

Hands-On Tasks

Perform the following hands-on tasks to practice word count using Spark SQL:

  1. Create a table named lines.
  2. Insert data into the table.
  3. Split lines into an array of words.
  4. Explode the array of words from each line into individual records.
  5. Use group by to get the count of each word.

Conclusion

In conclusion, you have learned how to perform word count using Spark SQL. Encourage you to practice these concepts and engage with the community for further learning.

Watch the video tutorial here