How to Run Spark 3 Glue Jobs Locally With Docker?

How to Run Spark 3 Glue Jobs Locally With Docker?

ZSOMBOR FÖLDESI

|

October 12, 2021

|

AWS Glue development requires that a developer endpoint should be running at all times. (In fact, technically it only has to run when the jobs are to be launched; however stopping the endpoint is not possible, and killing and re-creating it requires config changes which is a major hassle.) For smaller teams, in small or hobby projects it makes a lot of sense to develop and run Glue jobs locally, independently of AWS. This is possible with dockerized Spark - but AWS provides only limited support.

How to Run Spark 3 Glue Jobs Locally With Docker?

Although Spark 3 came out early June 2020, unfortunately AWS currently (as of October 2021) only provides a docker image with Spark 2.4. Fortunately Spark 3.1 engine is available in the cloud via the AWS console.

It is beyond the scope of this post to contemplate on whether it’s worth switching from 2 to 3, I just want to show you how you can run Glue jobs locally with the Spark 3.1 engine.

Note that this solution is based on the alruen attempt, but that didn’t quite work for our purpose. (However, I would like to give a thumbs up from here as well!)

Without further ado, you can try our pre-build image from here or here you will find everything you need to know for local build or customization.

Beyond docker, you need the AWS command line interface (AWS cli). After installing it you must authenticate with the aws-configure command.

You can then start the container with the following command:

docker run -it - -rm - -name local-glue -p 8080:8080 -p 9001:9001 -v $PWD/logs:/logs -v $PWD/notebook:/notebook -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_NOTEBOOK_DIR='/notebook' -v ~/.aws:/root/.aws:ro hiflylabs/local-aws-glue-v3-zeppelin:v1

Note that the ~/.aws:/root/ path represents the default location of the aws configuration files (which were created by the aws-configure command).

Once the container has started, you can start coding in the Zeppelin notebook, which you can access at http://127.0.0.1:9001, or you can attach the running container from VS code. Although it is not part of this tutorial, you can read more about this here.

If all went well, you can now successfully develop AWS glue jobs locally on your own machine with Spark version 3; you don’t need either the AWS console nor a developer endpoint.

Zsombor Földesi - Data Engineer

You can find our other blog posts here.

Article by ZSOMBOR FÖLDESI

DOCKER

Explore more stories

dbt Fusion: A First Look and Hands-On Review
dbt Labs recently announced dbt Fusion, a complete overhaul of the dbt Core engine built in Rust. It promises to significantly improve the developer experience. In this article, we test its core features and share our hands-on experience with the public beta, exploring what works, what doesn't, and what potential it holds for the future of dbt development.
Building Digital Products Around AI Agents - UX Meetup Recording
From more guardrails by developers to the UX challenge of showcasing sources in a non-deterministic system, the industry is still finding its grip on the whole process. We looked into the nitty-gritty details at our recent meetup on UX in the age of AI agents—check out the recording below!
Anonymization in Unstructured Data: A Guide for IT & Technical Managers - Part 2
Anonymizing unstructured data like medical records or legal documents is much harder than with structured data. The primary challenge is identifying sensitive information (PII) within free-form text, which can be obscured by jargon, abbreviations, and OCR errors. This guide explores the viable approaches, from simple rule-based systems to advanced Machine Learning and hybrid models.

Flying high with Hifly

We want to work with you

Hiflylabs is your partner in building your future. Share your ideas and let’s work together.