Data Engineering using Spark SQL - Getting Started - Overview of Spark SQL Properties

Let us understand details about Spark SQL properties which control Spark SQL run time environment.

  • Spark SQL inherits properties defined for Spark. There are some Spark SQL related properties as well and these are applicable even for Data Frames.

  • We can review these properties using Management Tools such as Ambari or Cloudera Manager Web UI

  • Spark run time behavior is controlled by HDFS Properties files, YARN Properties files, Hive Properties files etc in those clusters where Spark is integrated with Hadoop and Hive.

  • We can get all the properties using SET; in Spark SQL CLI

Let us review some important properties in Spark SQL.

spark.sql.warehouse.dir
spark.sql.catalogImplementation

spark.sql.warehouse.dir Property

The spark.sql.warehouse.dir property specifies the location of the default database in a Spark SQL environment. This is where Spark SQL will create default databases and tables.

To review the current value of this property, you can use the following command in Spark SQL CLI:

SET spark.sql.warehouse.dir;

spark.sql.catalogImplementation Property

The spark.sql.catalogImplementation property determines the implementation used to store the metadata in Spark SQL. It can have values such as in-memory, hive, in-memory-in-db, etc.

To review the current value of this property, you can use the following command in Spark SQL CLI:

SET spark.sql.catalogImplementation;

Overwriting Properties

Properties with default values may not show up in the SET command output. You can overwrite properties by setting new values using the same SET command. For example, you can change the value of spark.sql.shuffle.partitions property to 2 as follows:

SET spark.sql.shuffle.partitions=2;

Hands-On Tasks

  1. Open Spark SQL CLI and review the values of spark.sql.warehouse.dir and spark.sql.catalogImplementation properties.
  2. Experiment with changing the value of a property, such as spark.sql.shuffle.partitions, using the SET command.

Conclusion

In this article, we discussed the important properties in Spark SQL that control the run time behavior of Spark SQL environment. It is essential to understand and manage these properties effectively for optimal performance. Experiment with different properties and values to customize the behavior of Spark SQL according to your requirements. Practice more and engage with the community for further learning and insights.

Watch the video tutorial here