Introduction to Hadoop eco system - Overview of HDFS - HDFS Block Size

In this article, we will delve into the details related to block size in HDFS. HDFS, which stands for Hadoop Distributed File System, stores large files on multiple nodes in a distributed fashion. We will explore the default block size, its impact on file storage, and the configuration settings related to block size in HDFS.

Explanation for the video

[Embedded video placeholder - Link to YouTube video]

Key Concepts Explanation

Default Block Size

The default block size in HDFS is 128 MB, which means that a file larger than this size will be divided into multiple blocks and stored across different nodes in the Hadoop cluster. Let’s take a closer look at how the default block size affects file storage in HDFS.

hdfs dfs -ls -h /public/randomtextwriter/part-m-00000

Configuration in hdfs-site.xml

The block size configuration in HDFS is set in the hdfs-site.xml file using the property dfs.blocksize. This configuration determines the size of the blocks in which files are stored in HDFS.

cat /etc/hadoop/conf/hdfs-site.xml

File Block Allocation

For a file smaller than the default block size, only one block will be allocated to store the entire file. Let’s examine the block allocation for a file of size 2.9 MB in HDFS.

ls -lhtr /data/retail_db/orders/part-00000
hdfs fsck /user/${USER}/retail_db/orders/part-00000 -files -blocks -locations

Impact on File Storage

When storing a large file like yelp_academic_dataset_user.json of size 2.4 GB in HDFS, the file will be divided into 19 blocks, with most blocks being 128 MB in size. Let’s analyze the block allocation for this file in HDFS.

ls -lhtr /data/yelp-dataset-json/yelp_academic_dataset_user.json
hdfs fsck /public/yelp-dataset-json/yelp_academic_dataset_user.json -files -blocks -locations

Hands-On Tasks

  1. Check the block allocation for a file in HDFS using the hdfs fsck command.
  2. Experiment with different file sizes to understand how block size impacts file storage in HDFS.

Conclusion

In conclusion, understanding block size in HDFS is crucial for efficient file storage and retrieval in a Hadoop cluster. By configuring the block size appropriately and analyzing the block allocation for files, users can optimize storage and performance in HDFS. Practice exploring block size configurations and file storage in HDFS to deepen your understanding of this key concept.

Watch the video tutorial here