Category: Home

AWS Glue for Scalable ETL Workflows

Introduction As businesses shift to cloud-based data ecosystems, the need for scalable and efficient ETL (Extract, Transform, Load) processes becomes critical. AWS Glue, a server less data integration service, simplifies big data processing by automating schema discovery, ETL transformations, and data cataloging. This blog explores AWS Glue’s capabilities, best practices, and use cases to help

Read More
Apache Spark Partitions – AWS S3

Reading data from AWS S3 When Apache Spark reads data from Amazon S3 (Simple Storage Service), the process of creating partitions is different from reading data from HDFS. In the case of S3, Spark does not directly align partitions with the concept of HDFS blocks, as there is no block-based storage system like HDFS in

Read More
Apache Spark Partitions – HDFS

Reading data from HDFS When Apache Spark reads data from Hadoop Distributed File System (HDFS), the process of creating partitions is influenced by several factors. Here’s an overview of how Spark creates partitions when reading data from HDFS. HDFS Blocks The primary storage unit in HDFS is a block. By default, these blocks are commonly

Read More