Use AWS Glue ETL to carry out merge, partition evolution, and schema evolution on Apache Iceberg

March 5, 2024

14

As enterprises gather growing quantities of information from numerous sources, the construction and group of that knowledge typically want to alter over time to fulfill evolving analytical wants. Nevertheless, altering schema and desk partitions in conventional knowledge lakes could be a disruptive and time-consuming job, requiring renaming or recreating total tables and reprocessing giant datasets. This hampers agility and time to perception.

Schema evolution permits including, deleting, renaming, or modifying columns while not having to rewrite present knowledge. That is vital for fast-moving enterprises to reinforce knowledge constructions to assist new use instances. For instance, an ecommerce firm could add new buyer demographic attributes or order standing flags to complement analytics. Apache Iceberg manages these schema modifications in a backward-compatible approach by means of its progressive metadata desk evolution structure.

Equally, partition evolution permits seamless including, dropping, or splitting partitions. For example, an ecommerce market could initially partition order knowledge by day. As orders accumulate, and querying by day turns into inefficient, they could cut up to day and buyer ID partitions. Desk partitioning organizes large datasets most effectively for question efficiency. Iceberg offers enterprises the flexibleness to incrementally modify partitions somewhat than requiring tedious rebuild procedures. New partitions will be added in a totally appropriate approach with out downtime or having to rewrite present knowledge recordsdata.

This put up demonstrates how one can harness Iceberg, Amazon Easy Storage Service (Amazon S3), AWS Glue, AWS Lake Formation, and AWS Id and Entry Administration (IAM) to implement a transactional knowledge lake supporting seamless evolution. By permitting for painless schema and partition changes as knowledge insights evolve, you may profit from the future-proof flexibility wanted for enterprise success.

Overview of answer

For our instance use case, a fictional giant ecommerce firm processes 1000’s of orders every day. When orders are acquired, up to date, cancelled, shipped, delivered, or returned, the modifications are made of their on-premises system, and people modifications have to be replicated to an S3 knowledge lake in order that knowledge analysts can run queries by means of Amazon Athena. The modifications can comprise schema updates as nicely. As a result of safety necessities of various organizations, they should handle fine-grained entry management for the analysts by means of Lake Formation.

The next diagram illustrates the answer structure.

The answer workflow consists of the next key steps:

Ingest knowledge from on premises right into a Dropzone location utilizing a knowledge ingestion pipeline.
Merge the information from the Dropzone location into Iceberg utilizing AWS Glue.
Question the information utilizing Athena.

Conditions

For this walkthrough, you must have the next stipulations:

Arrange the infrastructure with AWS CloudFormation

To create your infrastructure with an AWS CloudFormation template, full the next steps:

Log in as an administrator to your AWS account.
Open the AWS CloudFormation console.
Select Launch Stack:
For Stack identify, enter a reputation (for this put up, icebergdemo1).
Select Subsequent.
Present data for the next parameters:
1. DatalakeUserName
2. DatalakeUserPassword
3. DatabaseName
4. TableName
5. DatabaseLFTagKey
6. DatabaseLFTagValue
7. TableLFTagKey
8. TableLFTagValue
Select Subsequent.
Select Subsequent once more.
Within the Evaluate part, overview the values you entered.
Choose I acknowledge that AWS CloudFormation may create IAM sources with customized names and select Submit.

In a couple of minutes, the stack standing will change to CREATE_COMPLETE.

You’ll be able to go to the Outputs tab of the stack to see all of the sources it has provisioned. The sources are prefixed with the stack identify you offered (for this put up, icebergdemo1).

Create an Iceberg desk utilizing Lambda and grant entry utilizing Lake Formation

To create an Iceberg desk and grant entry on it, full the next steps:

Navigate to the Sources tab of the CloudFormation stack icebergdemo1 and seek for logical ID named LambdaFunctionIceberg.
Select the hyperlink of the related bodily ID.

You’re redirected to the Lambda operate icebergdemo1-Lambda-Create-Iceberg-and-Grant-access.

On the Configuration tab, select Atmosphere variables within the left pane.

On the Code tab, you may examine the operate code.

The operate makes use of the AWS SDK for Python (Boto3) APIs to provision the sources. It assumes the provisioned knowledge lake admin position to carry out the next duties:

Grant DATA_LOCATION_ACCESS entry to the information lake admin position on the registered knowledge lake location
Create Lake Formation Tags (LF-Tags)
Create a database within the AWS Glue Knowledge Catalog utilizing the AWS Glue create_database API
Assign LF-Tags to the database
Grant DESCRIBE entry on the database utilizing LF-Tags to the information lake IAM person and AWS Glue ETL IAM position
Create an Iceberg desk utilizing the AWS Glue create_table API:

response_create_table = glue_client.create_table(
DatabaseName="icebergdb1",
OpenTableFormatInput= { 
 'IcebergInput': { 
 'MetadataOperation': 'CREATE',
 'Model': '2'
 }
},
TableInput={
    'Identify': ‘ecomorders’,
    'StorageDescriptor': {
        'Columns': [
            {'Name': 'ordernum', 'Type': 'int'},
            {'Name': 'sku', 'Type': 'string'},
            {'Name': 'quantity','Type': 'int'},
            {'Name': 'category','Type': 'string'},
            {'Name': 'status','Type': 'string'},
            {'Name': 'shipping_id','Type': 'string'}
        ],  
        'Location': 's3://icebergdemo1-s3bucketiceberg-vthvwwblrwe8/iceberg/'
    },
    'TableType': 'EXTERNAL_TABLE'
    }
)

Assign LF-Tags to the desk
Grant DESCRIBE and SELECT on the Iceberg desk LF-Tags for the information lake IAM person
Grant ALL, DESCRIBE, SELECT, INSERT, DELETE, and ALTER entry on the Iceberg desk LF-Tags to the AWS Glue ETL IAM position

On the Check tab, select Check to run the operate.

When the operate is full, you will note the message “Executing operate: succeeded.”

Lake Formation helps you centrally handle, safe, and globally share knowledge for analytics and machine studying. With Lake Formation, you may handle fine-grained entry management to your knowledge lake knowledge on Amazon S3 and its metadata within the Knowledge Catalog.

So as to add an Amazon S3 location as Iceberg storage in your knowledge lake, register the situation with Lake Formation. You’ll be able to then use Lake Formation permissions for fine-grained entry management to the Knowledge Catalog objects that time to this location, and to the underlying knowledge within the location.

The CloudFormation stack registered the information lake location.

Knowledge location permissions in Lake Formation allow principals to create and alter Knowledge Catalog sources that time to the designated registered Amazon S3 places. Knowledge location permissions work along with Lake Formation knowledge permissions to safe data in your knowledge lake.

Lake Formation tag-based entry management (LF-TBAC) is an authorization technique that defines permissions based mostly on attributes. In Lake Formation, these attributes are known as LF-Tags. You’ll be able to connect LF-Tags to Knowledge Catalog sources, Lake Formation principals, and desk columns. You’ll be able to assign and revoke permissions on Lake Formation sources utilizing these LF-Tags. Lake Formation permits operations on these sources when the principal’s tag matches the useful resource tag.

Confirm the Iceberg desk from the Lake Formation console

To confirm the Iceberg desk, full the next steps:

On the Lake Formation console, select Databases within the navigation pane.
Open the small print web page for icebergdb1.

You’ll be able to see the related database LF-Tags.

Select Tables within the navigation pane.
Open the small print web page for ecomorders.

Within the Desk particulars part, you may observe the next:

Desk format exhibits as Apache Iceberg
Desk administration exhibits as Managed by Knowledge Catalog
Location lists the information lake location of the Iceberg desk

Within the LF-Tags part, you may see the related desk LF-Tags.

Within the Desk particulars part, broaden Superior desk properties to view the next:

metadata_location factors to the situation of the Iceberg desk’s metadata file
table_type exhibits as ICEBERG

On the Schema tab, you may view the columns outlined on the Iceberg desk.

Combine Iceberg with the AWS Glue Knowledge Catalog and Amazon S3

Iceberg tracks particular person knowledge recordsdata in a desk as an alternative of directories. When there’s an specific commit on the desk, Iceberg creates knowledge recordsdata and provides them to the desk. Iceberg maintains the desk state in metadata recordsdata. Any change in desk state creates a brand new metadata file that atomically replaces the older metadata. Metadata recordsdata observe the desk schema, partitioning configuration, and different properties.

Iceberg requires file methods that assist the operations to be appropriate with object shops like Amazon S3.

Iceberg creates snapshots for the desk contents. Every snapshot is a whole set of information recordsdata within the desk at a cut-off date. Knowledge recordsdata in snapshots are saved in a number of manifest recordsdata that comprise a row for every knowledge file within the desk, its partition knowledge, and its metrics.

The next diagram illustrates this hierarchy.

While you create an Iceberg desk, it creates the metadata folder first and a metadata file within the metadata folder. The info folder is created if you load knowledge into the Iceberg desk.

Contents of the Iceberg metadata file

The Iceberg metadata file incorporates lots of data, together with the next:

format-version –Model of the Iceberg desk
Location – Amazon S3 location of the desk
Schemas – Identify and knowledge sort of all columns on the desk
partition-specs – Partitioned columns
sort-orders – Type order of columns
properties – Desk properties
current-snapshot-id – Present snapshot
refs – Desk references
snapshots – Listing of snapshots, every containing the next data:
- sequence-number – Sequence variety of snapshots in chronological order (the best quantity represents the present snapshot, 1 for the primary snapshot)
- snapshot-id – Snapshot ID
- timestamp-ms – Timestamp when the snapshot was dedicated
- abstract – Abstract of modifications dedicated
- manifest-list – Listing of manifests; this file identify begins with snap-< snapshot-id >
schema-id – Sequence variety of the schema in chronological order (the best quantity represents the present schema)
snapshot-log – Listing of snapshots in chronological order
metadata-log – Listing of metadata recordsdata in chronological order

The metadata file has all of the historic modifications to the desk’s knowledge and schema. Reviewing the contents on the metafile file straight could be a time-consuming job. Thankfully, you may question the Iceberg metadata utilizing Athena.

Iceberg framework in AWS Glue

AWS Glue 4.0 helps Iceberg tables registered with Lake Formation. Within the AWS Glue ETL jobs, you want the next code to allow the Iceberg framework:

from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark.conf import SparkConf
aws_account_id = boto3.shopper('sts').get_caller_identity().get('Account')

args = getResolvedOptions(sys.argv, ['JOB_NAME','warehouse_path']
    
# Arrange configuration for AWS Glue to work with Apache Iceberg
conf = SparkConf()
conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
conf.set("spark.sql.catalog.glue_catalog.warehouse", args['warehouse_path'])
conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
conf.set("spark.sql.catalog.glue_catalog.glue.lakeformation-enabled", "true")
conf.set("spark.sql.catalog.glue_catalog.glue.id", aws_account_id)

sc = SparkContext(conf=conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session

For learn/write entry to underlying knowledge, along with Lake Formation permissions, the AWS Glue IAM position to run the AWS Glue ETL jobs was granted lakeformation: GetDataAccess IAM permission. With this permission, Lake Formation grants the request for non permanent credentials to entry the information.

The CloudFormation stack provisioned the 4 AWS Glue ETL jobs for you. The identify of every job begins together with your stack identify (icebergdemo1). Full the next steps to view the roles:

Log in as an administrator to your AWS account.
On the AWS Glue console, select ETL jobs within the navigation pane.
Seek for jobs with icebergdemo1 within the identify.

Merge knowledge from Dropzone into the Iceberg desk

For our use case, the corporate ingests their ecommerce orders knowledge day by day from their on-premises location into an Amazon S3 Dropzone location. The CloudFormation stack loaded three recordsdata with pattern orders for 3 days, as proven within the following figures. You see the information within the Dropzone location s3://icebergdemo1-s3bucketdropzone-kunftrcblhsk/knowledge.

The AWS Glue ETL job icebergdemo1-GlueETL1-merge will run day by day to merge the information into the Iceberg desk. It has the next logic so as to add or replace the information on Iceberg:

Create a Spark DataFrame from enter knowledge:

df = spark.learn.format(dropzone_dataformat).possibility("header", True).load(dropzone_path)
df = df.withColumn("ordernum", df["ordernum"].forged(IntegerType())) 
    .withColumn("amount", df["quantity"].forged(IntegerType()))
df.createOrReplaceTempView("input_table")

For a brand new order, add it to the desk
If the desk has an identical order, replace the standing and shipping_id:

stmt_merge = f"""
    MERGE INTO glue_catalog.{database_name}.{table_name} AS t
    USING input_table AS s 
    ON t.ordernum= s.ordernum
    WHEN MATCHED 
            THEN UPDATE SET 
                t.standing = s.standing,
                t.shipping_id = s.shipping_id
    WHEN NOT MATCHED THEN INSERT *
    """
spark.sql(stmt_merge)

Full the next steps to run the AWS Glue merge job:

On the AWS Glue console, select ETL jobs within the navigation pane.
Choose the ETL job icebergdemo1-GlueETL1-merge.
On the Actions dropdown menu, select Run with parameters.
On the Run parameters web page, go to Job parameters.
For the --dropzone_path parameter, present the S3 location of the enter knowledge (icebergdemo1-s3bucketdropzone-kunftrcblhsk/knowledge/merge1).
Run the job so as to add all of the orders: 1001, 1002, 1003, and 1004.
For the --dropzone_path parameter, change the S3 location to icebergdemo1-s3bucketdropzone-kunftrcblhsk/knowledge/merge2.
Run the job once more so as to add orders 2001 and 2002, and replace orders 1001, 1002, and 1003.
For the --dropzone_path parameter, change the S3 location to icebergdemo1-s3bucketdropzone-kunftrcblhsk/knowledge/merge3.
Run the job once more so as to add order 3001 and replace orders 1001, 1003, 2001, and 2002.

Go to the information folder of desk to see the information recordsdata written by Iceberg if you merged the information into the desk utilizing the Glue ETL job icebergdemo1-GlueETL1-merge.

Question Iceberg utilizing Athena

The CloudFormation stack created the IAM person iceberguser1, which has learn entry on the Iceberg desk utilizing LF-Tags. To question Iceberg utilizing Athena through this person, full the next steps:

Log in as iceberguser1 to the AWS Administration Console.
On the Athena console, select Workgroups within the navigation pane.
Find the workgroup that CloudFormation provisioned (icebergdemo1-workgroup)
Confirm Athena engine model 3.

The Athena engine model 3 helps Iceberg file codecs, together with Parquet, ORC, and Avro.

Go to the Athena question editor.
Select the workgroup icebergdemo1-workgroup on the dropdown menu.
For Database, select icebergdb1. You will notice the desk ecomorders.
Run the next question to see the information within the Iceberg desk:
```
SELECT * FROM "icebergdb1"."ecomorders" ORDER BY ordernum ;
```
Run the next question to see desk’s present partitions:
```
DESCRIBE icebergdb1.ecomorders ;
```

Partition-spec describes how desk is partitioned. On this instance, there are not any partitioned fields since you didn’t outline any partitions on the desk.

Iceberg partition evolution

You might want to alter your partition construction; for instance, as a result of pattern modifications of frequent question patterns in downstream analytics. A change of partition construction for conventional tables is a big operation that requires a complete knowledge copy.

Iceberg makes this simple. While you change the partition construction on Iceberg, it doesn’t require you to rewrite the information recordsdata. The outdated knowledge written with earlier partitions stays unchanged. New knowledge is written utilizing the brand new specs in a brand new structure. Metadata for every of the partition variations is saved individually.

Let’s add the partition subject class to the Iceberg desk utilizing the AWS Glue ETL job icebergdemo1-GlueETL2-partition-evolution:

ALTER TABLE glue_catalog.icebergdb1.ecomorders
    ADD PARTITION FIELD class ;

On the AWS Glue console, run the ETL job icebergdemo1-GlueETL2-partition-evolution. When the job is full, you may question partitions utilizing Athena.

DESCRIBE icebergdb1.ecomorders ;

SELECT * FROM "icebergdb1"."ecomorders$partitions";

You’ll be able to see the partition subject class, however the partition values are null. There are not any new knowledge recordsdata within the knowledge folder, as a result of partition evolution is a metadata operation and doesn’t rewrite knowledge recordsdata. While you add or replace knowledge, you will note the corresponding partition values populated.

Iceberg schema evolution

Iceberg helps in-place desk evolution. You’ll be able to evolve a desk schema similar to SQL. Iceberg schema updates are metadata modifications, so no knowledge recordsdata have to be rewritten to carry out the schema evolution.

To discover the Iceberg schema evolution, run the ETL job icebergdemo1-GlueETL3-schema-evolution through the AWS Glue console. The job runs the next SparkSQL statements:

ALTER TABLE glue_catalog.icebergdb1.ecomorders
    ADD COLUMNS (shipping_carrier string) ;

ALTER TABLE glue_catalog.icebergdb1.ecomorders
    RENAME COLUMN shipping_id TO tracking_number ;

ALTER TABLE glue_catalog.icebergdb1.ecomorders
    ALTER COLUMN ordernum TYPE bigint ;

Within the Athena question editor, run the next question:

SELECT * FROM "icebergdb1"."ecomorders" ORDER BY ordernum asc ;

You’ll be able to confirm the schema modifications to the Iceberg desk:

A brand new column has been added known as shipping_carrier
The column shipping_id has been renamed to tracking_number
The info sort of the column ordernum has modified from int to bigint
```
DESCRIBE icebergdb1.ecomorders;
```

Positional replace

The info in tracking_number incorporates the transport provider concatenated with the monitoring quantity. Let’s assume that we wish to cut up this knowledge with the intention to hold the transport provider within the shipping_carrier subject and the monitoring quantity within the tracking_number subject.

On the AWS Glue console, run the ETL job icebergdemo1-GlueETL4-update-table. The job runs the next SparkSQL assertion to replace the desk:

UPDATE glue_catalog.icebergdb1.ecomorders
SET shipping_carrier = substring(tracking_number,1,3),
    tracking_number = substring(tracking_number,4,50)
WHERE tracking_number != '' ;

Question the Iceberg desk to confirm the up to date knowledge on tracking_number and shipping_carrier.

SELECT * FROM "icebergdb1"."ecomorders" ORDER BY ordernum ;

Now that the information has been up to date on the desk, you must see the partition values populated for class:

SELECT * FROM "icebergdb1"."ecomorders$partitions"
ORDER BY partition;

Clear up

To keep away from incurring future costs, clear up the sources you created:

On the Lambda console, open the small print web page for the operate icebergdemo1-Lambda-Create-Iceberg-and-Grant-access.
Within the Atmosphere variables part, select the important thing Task_To_Perform and replace the worth to CLEANUP.
Run the operate, which drops the database, desk, and their related LF-Tags.
On the AWS CloudFormation console, delete the stack icebergdemo1.

Conclusion

On this put up, you created an Iceberg desk utilizing the AWS Glue API and used Lake Formation to manage entry on the Iceberg desk in a transactional knowledge lake. With AWS Glue ETL jobs, you merged knowledge into the Iceberg desk, and carried out schema evolution and partition evolution with out rewriting or recreating the Iceberg desk. With Athena, you queried the Iceberg knowledge and metadata.

Primarily based on the ideas and demonstrations from this put up, now you can construct a transactional knowledge lake in an enterprise utilizing Iceberg, AWS Glue, Lake Formation, and Amazon S3.

In regards to the Writer

Satya Adimula is a Senior Knowledge Architect at AWS based mostly in Boston. With over twenty years of expertise in knowledge and analytics, Satya helps organizations derive enterprise insights from their knowledge at scale.