AWS Glue is a serverless information integration service that makes it simple to find, put together, transfer, and combine information from a number of sources for analytics, machine studying (ML), and software improvement.
AWS Glue prospects usually have to fulfill strict safety necessities, which typically contain locking down the community connectivity allowed to the job, or working inside a particular VPC to entry one other service. To run contained in the VPC, the roles must be assigned to a single subnet, however essentially the most appropriate subnet can change over time (for example, primarily based on the utilization and availability), so you could favor to make that call at runtime, primarily based by yourself technique.
Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is an AWS service to run managed Airflow workflows, which permit writing customized logic to coordinate how duties resembling AWS Glue jobs run.
On this put up, we present run an AWS Glue job as a part of an Airflow workflow, with dynamic configurable choice of the VPC subnet assigned to the job at runtime.
Resolution overview
To run inside a VPC, an AWS Glue job must be assigned at the least a connection that features community configuration. Any connection permits specifying a VPC, subnet, and safety group, however for simplicity, this put up makes use of connections of sort: NETWORK, which simply defines the community configuration and doesn’t contain exterior methods.
If the job has a hard and fast subnet assigned by a single connection, in case of a service outage on the Availability Zones or if the subnet isn’t out there for different causes, the job can’t run. Moreover, every node (driver or employee) in an AWS Glue job requires an IP deal with assigned from the subnet. When working many giant jobs concurrently, this might result in an IP deal with scarcity and the job working with fewer nodes than supposed or not working in any respect.
AWS Glue extract, remodel, and cargo (ETL) jobs enable a number of connections to be specified with a number of community configurations. Nonetheless, the job will at all times attempt to use the connections’ community configuration within the order listed and choose the primary one which passes the well being checks and has at the least two IP addresses to get the job began, which could not be the optimum choice.
With this answer, you’ll be able to improve and customise that conduct by reordering the connections dynamically and defining the choice precedence. If a retry is required, the connections are reprioritized once more primarily based on the technique, as a result of the situations may need modified for the reason that final run.
In consequence, it helps stop the job from failing to run or working below capability as a result of subnet IP deal with scarcity and even an outage, whereas assembly the community safety and connectivity necessities.
The next diagram illustrates the answer structure.
Stipulations
To observe the steps of the put up, you want a person that may log in to the AWS Administration Console and has permission to entry Amazon MWAA, Amazon Digital Non-public Cloud (Amazon VPC), and AWS Glue. The AWS Area the place you select to deploy the answer wants the capability to create a VPC and two elastic IP addresses. The default Regional quota for each varieties of assets is 5, so that you may have to request a rise by way of the console.
You additionally want an AWS Identification and Entry Administration (IAM) position appropriate to run AWS Glue jobs in case you don’t have one already. For directions, consult with Create an IAM position for AWS Glue.
Deploy an Airflow surroundings and VPC
First, you’ll deploy a brand new Airflow surroundings, together with the creation of a brand new VPC with two public subnets and two personal ones. It is because Amazon MWAA requires Availability Zone failure tolerance, so it must run on two subnets on two completely different Availability Zones within the Area. The general public subnets are used so the NAT Gateway can present web entry for the personal subnets.
Full the next steps:
- Create an AWS CloudFormation template in your pc by copying the template from the next fast begin information into a neighborhood textual content file.
- On the AWS CloudFormation console, select Stacks within the navigation pane.
- Select Create stack with the choice With new assets (commonplace).
- Select Add a template file and select the native template file.
- Select Subsequent.
- Full the setup steps, getting into a reputation for the surroundings, and depart the remainder of the parameters as default.
- On the final step, acknowledge that assets can be created and select Submit.
The creation can take 20–half-hour, till the standing of the stack modifications to CREATE_COMPLETE
.
The useful resource that may take most of time is the Airflow surroundings. Whereas it’s being created, you’ll be able to proceed with the next steps, till you might be required to open the Airflow UI.
- On the stack’s Sources tab, word the IDs for the VPC and two personal subnets (
PrivateSubnet1
andPrivateSubnet2
), to make use of within the subsequent step.
Create AWS Glue connections
The CloudFormation template deploys two personal subnets. On this step, you create an AWS Glue connection to every one so AWS Glue jobs can run in them. Amazon MWAA lately added the capability to run the Airflow cluster on shared VPCs, which reduces price and simplifies community administration. For extra data, consult with Introducing shared VPC help on Amazon MWAA.
Full the next steps to create the connections:
- On the AWS Glue console, select Information connections within the navigation pane.
- Select Create connection.
- Select Community as the info supply.
- Select the VPC and personal subnet (
PrivateSubnet1
) created by the CloudFormation stack. - Use the default safety group.
- Select Subsequent.
- For the connection title, enter
MWAA-Glue-Weblog-Subnet1
. - Evaluate the small print and full the creation.
- Repeat these steps utilizing
PrivateSubnet2
and title the connectionMWAA-Glue-Weblog-Subnet2
.
Create the AWS Glue job
Now you create the AWS Glue job that can be triggered later by the Airflow workflow. The job makes use of the connections created within the earlier part, however as a substitute of assigning them immediately on the job, as you’ll usually do, on this situation you permit the job connections listing empty and let the workflow determine which one to make use of at runtime.
The job script on this case just isn’t important and is simply supposed to show the job ran in one of many subnets, relying on the connection.
- On the AWS Glue console, select ETL jobs within the navigation pane, then select Script editor.
- Depart the default choices (Spark engine and Begin recent) and select Create script.
- Substitute the placeholder script with the next Python code:
- Rename the job to
AirflowBlogJob
. - On the Job particulars tab, for IAM Function, select any position and enter 2 for the variety of staff (only for frugality).
- Save these modifications so the job is created.
Grant AWS Glue permissions to the Airflow surroundings position
The position created for Airflow by the CloudFormation template offers the essential permissions to run workflows however to not work together with different providers resembling AWS Glue. In a manufacturing venture, you’ll outline your individual templates with these further permissions, however on this put up, for simplicity, you add the extra permissions as an inline coverage. Full the next steps:
- On the IAM console, select Roles within the navigation pane.
- Find the position created by the template; it’s going to begin with the title you assigned to the CloudFormation stack after which
-MwaaExecutionRole-
. - On the position particulars web page, on the Add permissions menu, select Create inline coverage.
- Change from Visible to JSON mode and enter the next JSON on the textbox. It assumes that the AWS Glue position you will have follows the conference of beginning with
AWSGlueServiceRole
. For enhanced safety, you’ll be able to substitute the wildcard useful resource on theec2:DescribeSubnets
permission with the ARNs of the 2 personal subnets from the CloudFormation stack. - Select Subsequent.
- Enter
GlueRelatedPermissions
because the coverage title and full the creation.
On this instance, we use an ETL script job; for a visible job, as a result of it generates the script robotically on save, the Airflow position would wish permission to write down to the configured script path on Amazon Easy Storage Service (Amazon S3).
Create the Airflow DAG
An Airflow workflow relies on a Directed Acyclic Graph (DAG), which is outlined by a Python file that programmatically specifies the completely different duties concerned and its interdependencies. Full the next scripts to create the DAG:
- Create a neighborhood file named
glue_job_dag.py
utilizing a textual content editor.
In every of the next steps, we offer a code snippet to enter into the file and a proof of what’s does.
- The next snippet provides the required Python modules imports. The modules are already put in on Airflow; if that weren’t the case, you would wish to make use of a
necessities.txt
file to point to Airflow which modules to put in. It additionally defines the Boto3 purchasers that the code will use later. By default, they may use the identical position and Area as Airflow, that’s why you arrange earlier than the position with the extra permissions required. - The next snippet provides three features to implement the connection order technique, which defines reorder the connections given to determine their precedence. That is simply an instance; you’ll be able to construct your customized code to implement your individual logic, as per your wants. The code first checks the IPs out there on every connection subnet and separates those which have sufficient IPs out there to run the job at full capability and those who might be used as a result of they’ve at the least two IPs out there, which is the minimal a job wants to start out. If the technique is about to
random
, it’s going to randomize the order inside every of the connection teams beforehand described and add some other connections. If the technique iscapability
, it’s going to organize them from most IPs free to fewest. - The next code creates the DAG itself with the run job activity, which updates the job with the connection order outlined by the technique, runs it, and waits for the outcomes. The job title, connections, and technique come from Airflow variables, so it may be simply configured and up to date. It has two retries with exponential backoff configured, so if the duties fails, it’s going to repeat the total activity together with the connection choice. Possibly now your best option is one other connection, or the subnet beforehand picked randomly is in an Availability Zone that’s presently struggling an outage, and by selecting a unique one, it could actually get well.
Create the Airflow workflow
Now you create a workflow that invokes the AWS Glue job you simply created:
- On the Amazon S3 console, find the bucket created by the CloudFormation template, which could have a reputation beginning with the title of the stack after which
-environmentbucket-
(for instance,myairflowstack-environmentbucket-ap1qks3nvvr4
). - Inside that bucket, create a folder referred to as
dags
, and inside that folder, add the DAG fileglue_job_dag.py
that you simply created within the earlier part. - On the Amazon MWAA console, navigate to the surroundings you deployed with the CloudFormation stack.
If the standing just isn’t but Accessible, wait till it reaches that state. It shouldn’t take longer than half-hour because you deployed the CloudFormation stack.
- Select the surroundings hyperlink on the desk to see the surroundings particulars.
It’s configured to choose up DAGs from the bucket and folder you used within the earlier steps. Airflow will monitor that folder for modifications.
- Select Open Airflow UI to open a brand new tab accessing the Airflow UI, utilizing the built-in IAM safety to log you in.
If there’s any situation with the DAG file you created, it’s going to show an error on high of the web page indicating the traces affected. In that case, evaluate the steps and add once more. After a number of seconds, it’s going to parse it and replace or take away the error banner.
- On the Admin menu, select Variables.
- Add three variables with the next keys and values:
- Key
glue_job_dag.glue_connections
with worthMWAA-Glue-Weblog-Subnet1,MWAA-Glue-Weblog-Subnet2
. - Key
glue_job_dag.glue_job_name
with worthAirflowBlogJob
. - Key
glue_job_dag.technique
with worthcapability
.
- Key
Run the job with a dynamic subnet project
Now you’re able to run the workflow and see the technique dynamically reordering the connections.
- On the Airflow UI, select DAGs, and on the row
glue_job_dag
, select the play icon. - On the Browse menu, select Activity situations.
- On the situations desk, scroll proper to show the
Log Url
and select the icon on it to open the log.
The log will replace as the duty runs; you’ll be able to find the road beginning with “Working Glue job with the connection order:” and the earlier traces exhibiting particulars of the connection IPs and the class assigned. If an error happens, you’ll see the small print on this log.
- On the AWS Glue console, select ETL jobs within the navigation pane, then select the job
AirflowBlogJob
. - On the Runs tab, select the run occasion, then the Output logs hyperlink, which can open a brand new tab.
- On the brand new tab, use the log stream hyperlink to open it.
It is going to show the IP that the driving force was assigned and which subnet it belongs to, which ought to match the connection indicated by Airflow (if the log just isn’t displayed, select Resume so it will get up to date as quickly because it’s out there).
- On the Airflow UI, edit the Airflow variable
glue_job_dag.technique
to set it torandom
. - Run the DAG a number of instances and see how the ordering modifications.
Clear up
If you happen to now not want the deployment, delete the assets to keep away from any additional costs:
- Delete the Python script you uploaded, so the S3 bucket could be robotically deleted within the subsequent step.
- Delete the CloudFormation stack.
- Delete the AWS Glue job.
- Delete the script that the job saved in Amazon S3.
- Delete the connections you created as a part of this put up.
Conclusion
On this put up, we confirmed how AWS Glue and Amazon MWAA can work collectively to construct extra superior customized workflows, whereas minimizing the operational and administration overhead. This answer provides you extra management about how your AWS Glue job runs to fulfill particular operational, community, or safety necessities.
You possibly can deploy your individual Amazon MWAA surroundings in a number of methods, resembling with the template used on this put up, on the Amazon MWAA console, or utilizing the AWS CLI. You may also implement your individual methods to orchestrate AWS Glue jobs, primarily based in your community structure and necessities (for example, to run the job nearer to the info when attainable).
Concerning the authors
Michael Greenshtein is an Analytics Specialist Options Architect for the Public Sector.
Gonzalo Herreros is a Senior Massive Information Architect on the AWS Glue staff.