In a data-driven world, clean and organized data serves as the driving force in decision-making. This requires integrating and processing data from different sources. This process is made easier with AWS. AWS Glue is a serverless, data integration service that provides visual and code-based interfaces for data integration. It leverages the strong computing capabilities of the Apache Spark engine to process large volumes of data.
With AWS Glue, you can discover, prepare, manage, and integrate data from more than 70 data sources and manage it in a centralized catalog. It can visually create, run, and monitor Extract, Transform, and Load pipelines (ETL) into a data warehouse. It is integrated with various AWS analytics services and Amazon S3 data lakes.
AWS Glue Studio
AWS Glue Studio is a graphical interface designed to create, run, and monitor ETL jobs in AWS Glue. You can create data pipelines by dragging and dropping data sources from which, Glue Studio generates a corresponding code.
When the job script is ready, you can run it on the Apache Spark-based serverless ETL engine. The progress of the job is tracked in real-time to quickly identify and fix issues. It also has a library of prebuilt transformations to manipulate and enhance data for common data processing tasks.
AWS Glue DataBrew
AWS Glue DataBrew is yet another tool from the Glue service stack. It is a visual data preparation tool with over 250 prebuilt transformations, which helps users to clean and normalize data. It automates time-consuming data preparation tasks, thereby providing faster insights. Databrew can support a wide range of data formats like CSV, JSON, Parquet, and many more. It also integrates seamlessly with other AWS services like S3, Redshift, and SageMaker, which makes it a versatile tool.
AWS Glue can be used to create and orchestrate an ETL pipeline. The components involved are:
Data Catalog: The Data Catalog is a centralized repository that stores the metadata. It stores information about schemas, data formats, and sources. An ETL job uses this metadata to understand the information about the data. The Data Catalog is organized into databases and tables
Database: It is used to create, access, or store data for the source or target.
Table: One or more tables are created in the database. These tables contain references to the actual data.
Crawlers and Classifiers: A crawler is used to automatically detect and extract the schema information of the data store. It can detect both file-based and table-based data stores. The crawler uses classifiers to identify the data format and determines how it should be processed. There are both built-in and custom classifiers.
Job: The job consists of an ETL script. You can use auto-generated scripts or provide your own script. A script is usually written in Python or Scala language.
Trigger: A trigger can start an ETL job based on schedule or on demand.
Development Endpoint: A development environment is where you can develop, test, and debug your ETL job scripts
Data Processing Unit: Data Processing Unit (DPU) is a relative measure of processing power. A single standard DPU provides 4 vCPU and 16 GB of memory and a high-memory DPU (M-DPU) provides 4 vCPU and 32 GB of memory.
What is ETL?
ETL (Extract, Transform, Load) is a set of processes used to transfer processed data from one source to another.
- Extract: Read data from a data source, like an S3 bucket
- Transform: Process the data to make it suited for analytics.
- Load: Load the transformed data into a data warehouse.
How to build an ETL pipeline
Set up AWS Glue
- Create an IAM role in the AWS Glue Console.
- Provide necessary permissions to access the source and target databases.
- Define a data catalog with databases and tables to organize the data assets.
Create a crawler
- Add a new crawler from the Glue console.
- Provide the name of the source data location.
- Select the IAM role with read access to the data
- In the output section define the data catalog with database and table names.
- Run the crawler to discover and catalog the data schema.
Create an ETL job
- Select the job section of the glue console.
- Add a new job and select the type of data (Spark or Python)
- Specify the Data Catalog database and table names for the source and target repository.
- Add the ETL job script (auto-generated or custom-defined).
- This script implements data transformation functions like cleaning, filtering, joining, aggregation, etc.
Test and Deploy
- Test the job with sample data.
- Monitor the logs to find any potential errors, issues, or exceptions.
- Modify the script to handle errors and exceptions and retest it.
- If successful, run the job with the full dataset.
Monitor and Optimize
- Use AWS Glue metrics and logs to track performance.
- Set up alerts for job failures or performance issues.
You can also create AWS Glue Workflows to coordinate multiple crawlers, jobs, and triggers to execute automated pipelines.
AWS Glue Pricing Structure
AWS Glue pricing is based on several factors such as the number of DPUs used, the duration of the job run, and the storage volume in the AWS Glue Data Catalog. The pricing structure for some of the key components is explained below.
ETL Jobs and interactive sessions
Here you are charged for the duration your ETL job takes to run, measured in seconds. There are no upfront costs and no charges for startup and shutdown time.
If you are using a Development Endpoint, you are charged for the time the session is active, measured in seconds.
For using Data Catalog, you are charged for the number of objects stored. This includes The first 1 million objects that can be stored for free, after which you will be charged $1.00 per 100,000 objects over a million, per month.
The crawlers are priced at an hourly rate for the runtime and number of DPUs used.
DataBrew interactive sessions are charged based on the number of sessions used, calculated in a 30-minute increment. For first-time users, the first 40 sessions are free.
DataBrew Jobs are charged based on the number of DataBrew nodes used to run the job. You pay only for the time taken to clean and normalize the data while running the job.
Generating data quality recommendations and scheduling and running data quality rules, within ETL or separately, incures DPU-hour charges.
The data source where the actual data is stored, like an S3 database will incur additional charges, based on the number of objects stored.
AWS Glue Pricing Chart
|ETL Jobs and Interactive Sessions
|Billed based on the total duration of job runs or interactive sessions in seconds.– The cost is determined by the number of DPUs allocated to the job or session.
|$0.44 per DPU-hour
|Data Catalog Objects
|Billed based on the number of objects stored in the Data Catalog.– Objects include databases, tables, partitions, connections, classifiers, and schemas.
|Based on Storage service
|Crawling is priced based on the number of Data Catalog tables created or updated.
|$0.44 per DPU-hour
|DataBrew costs are separate and depend on the specific features and usage within DataBrew.
|$1.00 per session$0.48 per DataBrew node hour
|AWS Glue DataBrew can be used for data quality tasks, and costs for DataBrew apply accordingly.
|$1.00 per session$0.48 per DataBrew node hour
|Based on the Storage service
|Separate costs based on the chosen storage service, such as Amazon S3.
|May include data transfer, other AWS services, and any optional add-ons or features you use.
|AWS Glue offers a free tier with certain usage limits.– For example, you may have a limited number of free DPUs and free crawler units each month.– Usage beyond the free tier limits incurs standard charges.
Cost Optimization Strategies for AWS Glue
Right-sizing AWS resources
Choosing the right AWS resources plays an important role in cost optimization. Apache Spark can split the workload into partitions and perform parallel processing. Hence estimating the minimum number of messages a spark core can handle, divide the total number of messages per second to process, and get the number of spark partitions needed.
With this number of Spark cores needed, you can calculate the number of nodes needed depending on the worker type and version.
|AWS Glue Version
|Spark Cores per worker
Setup AWS Budgets
With AWS Budgets, you can keep your resource spending in check. You can use AWS Budget service to set up a threshold limit for your Glue spent. Set up alarms to notify you when the data spend exceeds a specific limit. You can set up budgets for the entire service, individual glue jobs or a specific DPU-hour, etc.
The autoscaling feature is a powerful tool to reduce your resource utilization. With autoscaling, you can easily scale up or scale down your resources based on the workload and reduce idle time. It helps to prevent wastage of resources and thereby saves the cost associated with it.
You can vertically scale your Glue jobs that need high memory or disk space to store intermediate shuffle output. Glue’s G.1x and G.2x worker types provide high memory and disk space which is suitable for this type of scaling.
Using S3 to process large datasets will result in high costs. It is a good practice to incrementally process large datasets using AWS Glue Job Bookmarks, Push-down predictions, and Exclusions. It is a best practice to incrementally process large datasets using AWS Glue Job Bookmarks, Push-down Predicates, and Exclusions.
Exclusions for S3 Storage Classes
We can set up S3 lifecycle policies to transmit old data from frequently accessed “hot” storage to less used “cold” storage, like a Standard IA or S3 Glacier. Glue jobs unaware of these changes might still try to process this cold data.
Glue’s Storage class exclusion allows you to tell your ETL job to skip cold data entirely during processing. This helps in saving data transfer fees.
With its serverless architecture and comprehensive features, AWS Glue has simplified data integration and the creation of ETL/ELT workflows. Its flexible pricing model and scalability make it a great tool for Data Analytics.
In this article, we have tried to provide an overview of the AWS Glue service and its workflows. However, its pricing structure is a bit more intricate. For a more personalized walkthrough of its billing and cost optimization strategies, book a demo with our experts.