Data Engineering using Spark SQL - Getting Started - Exercise - Getting Started with Spark SQL

Getting Started with Spark SQL

This article provides a step-by-step guide on getting started with Spark SQL. The accompanying video explains the concepts covered in the article.

Video Placeholder Link

Key Concepts Explanation

Spark SQL

Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as a distributed SQL query engine.

val sparkSession = SparkSession.builder()
  .appName("Spark SQL Example")
  .getOrCreate()

DataFrames

DataFrames are distributed collections of data organized into named columns. They provide optimized methods to work with structured data.

val df = sparkSession.read.json("people.json")
df.show()

Hands-On Tasks

  1. Launch Spark SQL using the provided script or Spark shell.
  2. Create a database and exit, ensuring to prefix the database name with your OS username.
  3. Exit and reconnect to your database.
  4. Create a table named ‘orders’ using the provided script.
  5. List the tables in your database.
  6. Describe the ‘orders’ table to review its metadata.

Conclusion

In conclusion, this article has covered the basics of getting started with Spark SQL. Practice the tasks provided to solidify your understanding and consider engaging with the community for further learning.

Watch the video tutorial here