You are currently viewing Improve monitoring and debugging for AWS Glue jobs utilizing new job observability metrics

Improve monitoring and debugging for AWS Glue jobs utilizing new job observability metrics


For any trendy data-driven firm, having easy information integration pipelines is essential. These pipelines pull information from varied sources, remodel it, and cargo it into vacation spot programs for analytics and reporting. When operating correctly, it supplies well timed and reliable info. Nevertheless, with out vigilance, the various information volumes, traits, and software habits could cause information pipelines to change into inefficient and problematic. Efficiency can decelerate or pipelines can change into unreliable. Undetected errors lead to dangerous information and influence downstream evaluation. That’s why strong monitoring and troubleshooting for information pipelines is important throughout the next 4 areas:

  • Reliability
  • Efficiency
  • Throughput
  • Useful resource utilization

Collectively, these 4 points of monitoring present end-to-end visibility and management over a knowledge pipeline and its operations.

Right this moment we’re happy to announce a brand new class of Amazon CloudWatch metrics reported together with your pipelines constructed on prime of AWS Glue for Apache Spark jobs. The brand new metrics present mixture and fine-grained insights into the well being and operations of your job runs and the info being processed. Along with offering insightful dashboards, the metrics present classification of errors, which helps with root trigger evaluation of efficiency bottlenecks and error analysis. With this evaluation, you may consider and apply the advisable fixes and greatest practices for architecting your jobs and pipelines. Because of this, you acquire the good thing about increased availability, higher efficiency, and decrease value on your AWS Glue for Apache Spark workload.

This submit demonstrates how the brand new enhanced metrics assist you monitor and debug AWS Glue jobs.

Allow the brand new metrics

The brand new metrics might be configured via the job parameter enable-observability-metrics.

The brand new metrics are enabled by default on the AWS Glue Studio console. To configure the metrics on the AWS Glue Studio console, full the next steps:

  1. On the AWS Glue console, select ETL jobs within the navigation pane.
  2. Below Your jobs, select your job.
  3. On the Job particulars tab, increase Superior properties.
  4. Below Job observability metrics, choose Allow the creation of extra observability CloudWatch metrics when this job runs.

To allow the brand new metrics within the AWS Glue CreateJob and StartJobRun APIs, set the next parameters within the DefaultArguments property:

  • Key--enable-observability-metrics
  • Worthtrue

To allow the brand new metrics within the AWS Command Line Interface (AWS CLI), set the identical job parameters within the --default-arguments argument.

Use case

A typical workload for AWS Glue for Apache Spark jobs is to load information from a relational database to a knowledge lake with SQL-based transformations. The next is a visible illustration of an instance job the place the variety of staff is 10.

When the instance job ran, the workerUtilization metrics confirmed the next development.

Notice that workerUtilization confirmed values between 0.20 (20%) and 0.40 (40%) for your entire period. This usually occurs when the job capability is over-provisioned and lots of Spark executors have been idle, leading to pointless value. To enhance useful resource utilization effectivity, it’s a good suggestion to allow AWS Glue Auto Scaling. The next screenshot reveals the identical workerUtilization metrics graph when AWS Glue Auto Scaling is enabled for a similar job.

workerUtilization confirmed 1.0 at first due to AWS Glue Auto Scaling and it trended between 0.75 (75%) and 1.0 (100%) based mostly on the workload necessities.

Question and visualize metrics in CloudWatch

Full the next steps to question and visualize metrics on the CloudWatch console:

  1. On the CloudWatch console, select All metrics within the navigation pane.
  2. Below Customized namespaces, select Glue.
  3. Select Observability Metrics (or Observability Metrics Per Supply, or Observability Metrics Per Sink).
  4. Seek for and choose the particular metric identify, job identify, job run ID, and observability group.
  5. On the Graphed metrics tab, configure your most well-liked statistic, interval, and so forth.

Question metrics utilizing the AWS CLI

Full the next steps for querying utilizing the AWS CLI (for this instance, we question the employee utilization metric):

  1. Create a metric definition JSON file (present your AWS Glue job identify and job run ID):
    $ cat multiplequeries.json
    [
      {
        "Id": "avgWorkerUtil_0",
        "MetricStat" : {
          "Metric" : {
            "Namespace": "Glue",
            "MetricName": "glue.driver.workerUtilization",
            "Dimensions": [
              {
                  "Name": "JobName",
                  "Value": "<your-Glue-job-name-A>"
              },
              {
                "Name": "JobRunId",
                "Value": "<your-Glue-job-run-id-A>"
              },
              {
                "Name": "Type",
                "Value": "gauge"
              },
              {
                "Name": "ObservabilityGroup",
                "Value": "resource_utilization"
              }
            ]
          },
          "Interval": 1800,
          "Stat": "Minimal",
          "Unit": "None"
        }
      },
      {
          "Id": "avgWorkerUtil_1",
          "MetricStat" : {
          "Metric" : {
            "Namespace": "Glue",
            "MetricName": "glue.driver.workerUtilization",
            "Dimensions": [
               {
                 "Name": "JobName",
                 "Value": "<your-Glue-job-name-B>"
               },
               {
                 "Name": "JobRunId",
                 "Value": "<your-Glue-job-run-id-B>"
               },
               {
                 "Name": "Type",
                 "Value": "gauge"
               },
               {
                 "Name": "ObservabilityGroup",
                 "Value": "resource_utilization"
               }
            ]
          },
          "Interval": 1800,
          "Stat": "Minimal",
          "Unit": "None"
        }
      }
    ]

  2. Run the get-metric-data command:
    $ aws cloudwatch get-metric-data --metric-data-queries file://multiplequeries.json 
         --start-time '2023-10-28T18:20' 
         --end-time '2023-10-28T19:10'  
         --region us-east-1
    {
        "MetricDataResults": [
          {
             "Id": "avgWorkerUtil_0",
             "Label": "<your label A>",
             "Timestamps": [
                   "2023-10-28T18:20:00+00:00"
             ], 
             "Values": [
                   0.06718750000000001
             ],
             "StatusCode": "Full"
          },
          {
             "Id": "avgWorkerUtil_1",
             "Label": "<your label B>",
             "Timestamps": [
                  "2023-10-28T18:20:00+00:00"
              ],
              "Values": [
                  0.5959183673469387
              ],
              "StatusCode": "Full"
           }
        ],
        "Messages": []
    }

Create a CloudWatch alarm

You may create static threshold-based alarms for the totally different metrics. For directions, check with Create a CloudWatch alarm based mostly on a static threshold.

For instance, for skewness, you may set an alarm for skewness.stage with a threshold of 1.0, and skewness.job with a threshold of 0.5. This threshold is only a advice; you may modify the brink based mostly in your particular use case (for instance, some jobs are anticipated to be skewed and it’s not a difficulty to be alarmed for). Our advice is to guage the metric values of your job runs for a while earlier than qualifying the anomalous values and configuring the thresholds to alarm.

Different enhanced metrics

For a full listing of different enhanced metrics accessible with AWS Glue jobs, check with Monitoring with AWS Glue Observability metrics. These metrics mean you can seize the operational insights of your jobs, corresponding to useful resource utilization (reminiscence and disk), normalized error courses corresponding to compilation and syntax, consumer or service errors, and throughput for every supply or sink (information, information, partitions, and bytes learn or written).

Job observability dashboards

You may additional simplify observability on your AWS Glue jobs utilizing dashboards for the perception metrics that allow real-time monitoring utilizing Amazon Managed Grafana, and allow visualization and evaluation of traits with Amazon QuickSight.

Conclusion

This submit demonstrated how the brand new enhanced CloudWatch metrics assist you monitor and debug AWS Glue jobs. With these enhanced metrics, you may extra simply establish and troubleshoot points in actual time. This ends in AWS Glue jobs that have increased uptime, sooner processing, and diminished expenditures. The tip profit for you is simpler and optimized AWS Glue for Apache Spark workloads. The metrics can be found in all AWS Glue supported Areas. Test it out!


Concerning the Authors

Noritaka Sekiyama is a Principal Large Knowledge Architect on the AWS Glue workforce. He’s accountable for constructing software program artifacts to assist clients. In his spare time, he enjoys biking along with his new highway bike.

Shenoda Guirguis is a Senior Software program Improvement Engineer on the AWS Glue workforce. His ardour is in constructing scalable and distributed Knowledge Infrastructure/Processing Methods. When he will get an opportunity, Shenoda enjoys studying and enjoying soccer.

Sean Ma is a Principal Product Supervisor on the AWS Glue workforce. He has an 18+ yr observe report of innovating and delivering enterprise merchandise that unlock the ability of information for customers. Exterior of labor, Sean enjoys scuba diving and school soccer.

Mohit Saxena is a Senior Software program Improvement Supervisor on the AWS Glue workforce. His workforce focuses on constructing distributed programs to allow clients with interactive and easy to make use of interfaces to effectively handle and remodel petabytes of information seamlessly throughout information lakes on Amazon S3, databases and data-warehouses on cloud.

Leave a Reply