Tag: AWS

Apache Spark Partitions – AWS S3

Reading data from AWS S3 When Apache Spark reads data from Amazon S3 (Simple Storage Service), the process of creating partitions is different from reading data from HDFS. In the case of S3, Spark does not directly align partitions with the concept of HDFS blocks, as there is no block-based storage system like HDFS in

Read More
Apache Spark Partitions – HDFS

Reading data from HDFS When Apache Spark reads data from Hadoop Distributed File System (HDFS), the process of creating partitions is influenced by several factors. Here’s an overview of how Spark creates partitions when reading data from HDFS. HDFS Blocks The primary storage unit in HDFS is a block. By default, these blocks are commonly

Read More
AWS Glue Job vs. EMR Spark Job Cost Comparison

Deciding on the most cost-effective option for your Spark jobs can be tricky, as AWS Glue and EMR have distinct pricing models and capabilities. Let’s dive into a quick comparison to help you choose. Cost Comparison Considerations: Recommendation: It is essential to evaluate your specific use case, workload characteristics, and preferences to determine the most

Read More