In AWS, tons of of 1000’s of consumers use AWS Glue, a serverless information integration service, to find, mix, and put together information for analytics and machine studying. When you could have advanced datasets and demanding Apache Spark workloads, chances are you’ll expertise efficiency bottlenecks or errors throughout Spark job runs. Troubleshooting these points will be troublesome and delay getting jobs working in manufacturing. Prospects usually use Apache Spark Internet UI, a well-liked debugging instrument that’s a part of open supply Apache Spark, to assist repair issues and optimize job efficiency. AWS Glue helps Spark UI in two other ways, however you should set it up your self. This requires effort and time spent managing networking and EC2 situations, or by way of trial-and error with Docker containers.
In the present day, we’re happy to announce serverless Spark UI constructed into the AWS Glue console. Now you can use Spark UI simply because it’s a built-in part of the AWS Glue console, enabling you to entry it with a single click on when analyzing the small print of any given job run. There’s no infrastructure setup or teardown required. AWS Glue serverless Spark UI is a fully-managed serverless providing and usually begins up in a matter of seconds. Serverless Spark UI makes it considerably sooner and simpler to get jobs working in manufacturing as a result of you could have prepared entry to low degree particulars to your job runs.
This submit describes how the AWS Glue serverless Spark UI lets you monitor and troubleshoot your AWS Glue job runs.
Getting began with serverless Spark UI
You possibly can entry the serverless Spark UI for a given AWS Glue job run by navigating out of your Job’s web page in AWS Glue console.
- On the AWS Glue console, select ETL jobs.
- Select your job.
- Select the Runs tab.
- Choose the job run you wish to examine, then select Spark UI.
The Spark UI will show within the decrease pane, as proven within the following display seize:
Alternatively, you will get to the serverless Spark UI for a particular job run by navigating from Job run monitoring in AWS Glue.
- On the AWS Glue console, select job run monitoring underneath ETL jobs.
- Choose your job run, and select View run particulars.
Scroll all the way down to the underside to view the Spark UI for the job run.
Conditions
Full the next prerequisite steps:
- Allow Spark UI occasion logs to your job runs. It’s enabled by default on Glue console and as soon as enabled, Spark occasion log recordsdata might be created in the course of the job run, and saved in your S3 bucket. The serverless Spark UI parses a Spark occasion log file generated in your S3 bucket to visualise detailed data for each working and accomplished job runs. A progress bar exhibits the share to completion, with a typical parsing time of lower than a minute. As soon as logs are parsed, you possibly can
- When logs are parsed, you should utilize the built-in Spark UI to debug, troubleshoot, and optimize your jobs.
For extra details about Apache Spark UI, consult with Internet UI in Apache Spark.
Monitor and Troubleshoot with Serverless Spark UI
A typical workload for AWS Glue for Apache Spark jobs is loading information from relational databases to S3-based information lakes. This part demonstrates learn how to monitor and troubleshoot an instance job run for the above workload with serverless Spark UI. The pattern job reads information from MySQL database and writes to S3 in Parquet format. The supply desk has roughly 70 million data.
The next display seize exhibits a pattern visible job authored in AWS Glue Studio visible editor. On this instance, the supply MySQL desk has already been registered within the AWS Glue Information Catalog prematurely. It may be registered by way of AWS Glue crawler or AWS Glue catalog API. For extra data, consult with Information Catalog and crawlers in AWS Glue.
Now it’s time to run the job! The primary job run completed in half-hour and 10 seconds as proven:
Let’s use Spark UI to optimize the efficiency of this job run. Open Spark UI tab within the Job runs web page. Whenever you drill all the way down to Levels and think about the Period column, you’ll discover that Stage Id=0 spent 27.41 minutes to run the job, and the stage had just one Spark activity within the Duties:Succeeded/Whole column. Which means there was no parallelism to load information from the supply MySQL database.
To optimize the information load, introduce parameters known as hashfield
and hashpartitions
to the supply desk definition. For extra data, consult with Studying from JDBC tables in parallel. Persevering with to the Glue Catalog desk, add two properties: hashfield=emp_no
, and hashpartitions=18
in Desk properties.
This implies the brand new job runs studying parallelize information load from the supply MySQL desk.
Let’s attempt working the identical job once more! This time, the job run completed in 9 minutes and 9 seconds. It saved 21 minutes from the earlier job run.
As a greatest observe, view the Spark UI and examine them earlier than and after the optimization. Drilling all the way down to Accomplished levels, you’ll discover that there was one stage and 18 duties as a substitute of 1 activity.
Within the first job run, AWS Glue mechanically shuffled information throughout a number of executors earlier than writing to vacation spot as a result of there have been too few duties. Alternatively, within the second job run, there was just one stage as a result of there was no have to do further shuffling, and there have been 18 duties for loading information in parallel from supply MySQL database.
Issues
Be mindful the next issues:
- Serverless Spark UI is supported in AWS Glue 3.0 and later
- Serverless Spark UI might be accessible for jobs that ran after November 20, 2023, as a consequence of a change in how AWS Glue emits and shops Spark logs
- Serverless Spark UI can visualize Spark occasion logs which is as much as 1 GB in dimension
- There isn’t any restrict in retention as a result of serverless Spark UI scans the Spark occasion log recordsdata in your S3 bucket
- Serverless Spark UI shouldn’t be accessible for Spark occasion logs saved in S3 bucket that may solely be accessed by your VPC
Conclusion
This submit described how the AWS Glue serverless Spark UI helps you monitor and troubleshoot your AWS Glue jobs. By offering instantaneous entry to the Spark UI immediately throughout the AWS Administration Console, now you can examine the low-level particulars of job runs to determine and resolve points. With the serverless Spark UI, there isn’t any infrastructure to handle—the UI spins up mechanically for every job run and tears down when now not wanted. This streamlined expertise saves you effort and time in comparison with manually launching Spark UIs your self.
Give the serverless Spark UI a attempt at present. We predict you’ll discover it invaluable for optimizing efficiency and shortly troubleshooting errors. We sit up for listening to your suggestions as we proceed enhancing the AWS Glue console expertise.
In regards to the authors
Noritaka Sekiyama is a Principal Massive Information Architect on the AWS Glue staff. He works primarily based in Tokyo, Japan. He’s liable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking on his street bike.
Alexandra Tello is a Senior Entrance Finish Engineer with the AWS Glue staff in New York Metropolis. She is a passionate advocate for usability and accessibility. In her free time, she’s an espresso fanatic and enjoys constructing mechanical keyboards.
Matt Sampson is a Software program Growth Supervisor on the AWS Glue staff. He loves working together with his different Glue staff members to make companies that our clients profit from. Outdoors of labor, he will be discovered fishing and perhaps singing karaoke.
Matt Su is a Senior Product Supervisor on the AWS Glue staff. He enjoys serving to clients uncover insights and make higher selections utilizing their information with AWS Analytic companies. In his spare time, he enjoys snowboarding and gardening.