Execute Glue Job Locally in IntelliJ Without an AWS Account

Execute Glue Job Locally in IntelliJ Without an AWS Account


The blog post guides users on running an AWS Glue job locally within the IntelliJ environment without the need for an AWS account. It outlines the steps to execute Glue jobs, offering a practical solution for developers to test and debug their Glue scripts locally before deploying them to the AWS cloud. The approach enhances flexibility and efficiency in the development and debugging process for Glue jobs without requiring access to AWS infrastructure.

Benefits

  1. Efficient Development and Testing:
    • Developers can test and iterate on Glue jobs directly within the IntelliJ environment, allowing for rapid development and testing cycles without the need for continuous AWS deployments. This accelerates the development process and enhances efficiency.
  2. Cost Savings and Accessibility:
    • The local execution eliminates the requirement for an active AWS account during the development and testing phase, resulting in cost savings.
  3. Seamless Integration with Local Development Tools:
    • Executing Glue jobs locally within IntelliJ provides a seamless integration with local development tools. Developers can leverage the features and familiarity of the IntelliJ IDE for coding, debugging, and testing, leading to a more streamlined and productive development experience.

Steps

Here is the code repository with demo project https://github.com/cloud-content/aws-glue-local

Followed steps mentioned under the section “Developing locally with Scala” in the AWS documentation page https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

Here are the high level steps:

  • Install software
    • Maven
      • You might already be having maven setup locally. If not, please download it from maven site based on your machine OS or above AWS page also have download link from glue etl artifactory.
    • Apache Spark distribution based on the Glue version that you want to use.
      • I used the link provided in above AWS page however you can use any Apache Spark 3.x version and can be downloaded from https://spark.apache.org/downloads.html but make sure that corresponding hadoop version is bundled along with spark.
  • Configuration
    • Maven project
      • You can start by using the pom.xml mentioned in cloud-content github repo aws-glue-local. And make necessary changes as per your need.
    • SPARK_HOME
      • Setup SPARK_HOME environment variable by using “export SPARK_HOME=/home/$USER/spark-3.3.0-amzn-1-bin-3.3.3-amzn-0” OR we can set it up in run configuration of IntelliJ project. Please change the spark library location as per your local setup.

Below is the sample scala script that should work from local IntelliJ when required external libraries are available.

GlueApp.Scala

import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import scala.collection.JavaConverters._

object GlueApp {
  def main(sysArgs: Array[String]) {
    System.setProperty("spark.app.name", "GlueJob")
    System.setProperty("spark.master","local[*]")
    val sc: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(sc)
    val spark: SparkSession = glueContext.getSparkSession
    import spark.implicits._

    val args = GlueArgParser.getResolvedOptions(sysArgs, Seq("JOB_NAME").toArray)
    Job.init(args("JOB_NAME"), glueContext, args.asJava)
    val df = spark.read
      .format("csv")
      .option("header", "true")
      .load("src/main/resources/employees.csv")  
    df.printSchema()
    df.show(10, false)
    Job.commit()
  }
}

Explanation

  • Initializing Spark and Glue Context.
  • Defining Spark Session from Glue Context
  • Initializing Glue Job using supplied command line arguments.
  • Read employees.csv file using spark session into a Dataframe variable.
  • Print Dataframe schema.
  • Print 10 rows from the Dataframe.
  • Commit Glue Job.

Challenges

  • Make sure you can access required maven external repositories to download required libraries mentioned in pom.xml. The Artifact AWSGlueETL mentioned in pom.xml provides required Glue support when we run it from local.
  • The org.apache.commons lang3 version that is supplied as part of AWSGlueETL artifact have some compatibility issues with other dependent libraries hence I had to explicitly download below lang3 library.

<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-lang3</artifactId>
<version>3.11</version>
</dependency>

  • Had to setup below config properties when running Glue job locally however we don’t need them when we run this script on AWS Glue.
  • System.setProperty(“spark.app.name”, “GlueJob”)
  • System.setProperty(“spark.master”,”local[*]”)