Author: iomrslmy

Apache Spark Partitions – AWS S3

Reading data from AWS S3 When Apache Spark reads data from Amazon S3 (Simple Storage Service), the process of creating partitions is different from reading data from HDFS. In the case of S3, Spark does not directly align partitions with the concept of HDFS blocks, as there is no block-based storage system like HDFS in

Read More
Apache Spark Partitions – HDFS

Reading data from HDFS When Apache Spark reads data from Hadoop Distributed File System (HDFS), the process of creating partitions is influenced by several factors. Here’s an overview of how Spark creates partitions when reading data from HDFS. HDFS Blocks The primary storage unit in HDFS is a block. By default, these blocks are commonly

Read More
Execute Glue Job Locally in IntelliJ Without an AWS Account

The blog post guides users on running an AWS Glue job locally within the IntelliJ environment without the need for an AWS account. It outlines the steps to execute Glue jobs, offering a practical solution for developers to test and debug their Glue scripts locally before deploying them to the AWS cloud. The approach enhances

Read More
AWS Glue Job vs. EMR Spark Job Cost Comparison

Deciding on the most cost-effective option for your Spark jobs can be tricky, as AWS Glue and EMR have distinct pricing models and capabilities. Let’s dive into a quick comparison to help you choose. Cost Comparison Considerations: Recommendation: It is essential to evaluate your specific use case, workload characteristics, and preferences to determine the most

Read More