Apache Spark Partitions – HDFS

Apache Spark Partitions – HDFS

Reading data from HDFS

When Apache Spark reads data from Hadoop Distributed File System (HDFS), the process of creating partitions is influenced by several factors. Here’s an overview of how Spark creates partitions when reading data from HDFS.

HDFS Blocks

The primary storage unit in HDFS is a block. By default, these blocks are commonly sized at 128 MB or 256 MB, although adjustments can be made to this configuration. Each block is individually housed on a distinct node within the Hadoop cluster.

InputSplit

In Spark, the process of reading data from HDFS involves creating InputSplits. An InputSplit is a logical representation of a portion of the data that Spark can process in parallel. Each InputSplit corresponds to an HDFS block.

Number of Partitions

The number of partitions created by Spark is influenced by the number of InputSplits, which is determined by the number of HDFS blocks in the input files. Each InputSplit becomes a partition in Spark.

Parallelism

The level of parallelism during the read operation is determined by the number of partitions. Spark aims to read and process data in parallel, and the number of partitions directly affects the level of parallelism.

Configuration Parameters

Spark provides configuration parameters that allow users to control the partitioning behavior.

spark.sql.files.maxPartitionBytes and spark.sql.files.openCostInBytes are parameters that influence how Spark determines the number of partitions based on the file sizes.

For example,

Users can also explicitly set the number of partitions using the repartition and coalesce methods on DataFrames after reading the data.

It’s important to note that the effective partitioning also depends on the characteristics of the data, the number of nodes in the Spark cluster, and other relevant configuration parameters.

Adjustments to the default configurations may be necessary based on the specifics of the use case and the cluster environment.

Here is an example illustrating how Spark might create partitions when reading data from HDFS:

  • If a file in HDFS is 512 MB and the default HDFS block size is 128 MB, Spark will create four InputSplits (partitions) for this file, with each InputSplit corresponding to a 128 MB HDFS block.
  • If multiple files are being read in parallel, Spark will create partitions for each file, and these partitions can be processed concurrently.

It’s important to note that the effective partitioning also depends on the characteristics of the data, the number of nodes in the Spark cluster, and other relevant configuration parameters.

Adjustments to the default configurations may be necessary based on the specifics of the use case and the cluster environment.

Key Takeaways:

  • Spark strives for one partition per HDFS block by default.
  • Configuration options can adjust partition count.
  • Partition discovery depends on file format and configuration.
  • Data is distributed evenly across partitions.
  • File size, format, cluster setup, and Spark settings impact partition size.