Automated knowledge governance with AWS Glue Information High quality, delicate knowledge detection, and AWS Lake Formation

October 11, 2023

21

Information governance is the method of making certain the integrity, availability, usability, and safety of a company�s knowledge. As a result of quantity, velocity, and number of knowledge being ingested in knowledge lakes, it might probably get difficult to develop and keep insurance policies and procedures to make sure knowledge governance at scale in your knowledge lake. Information confidentiality and knowledge high quality are the 2 important themes for knowledge governance. Information confidentiality refers back to the safety and management of delicate and personal info to stop unauthorized entry, particularly when coping with personally identifiable info (PII). Information high quality focuses on sustaining correct, dependable, and constant knowledge throughout the group. Poor knowledge high quality can result in misguided selections, inefficient operations, and compromised enterprise efficiency.

Firms want to make sure knowledge confidentiality is maintained all through the information pipeline and that high-quality knowledge is offered to shoppers in a well timed method. Quite a lot of this effort is guide, the place knowledge homeowners and knowledge stewards outline and apply the insurance policies statically up entrance for every dataset within the lake. This will get tedious and delays the information adoption throughout the enterprise.

On this submit, we showcase tips on how to use AWS Glue with AWS Glue Information High quality, delicate knowledge detection transforms, and AWS Lake Formation tag-based entry management to automate knowledge governance.

Resolution overview

Let�s take into account a fictional firm, OkTank. OkTank has a number of ingestion pipelines that populate a number of tables within the knowledge lake. OkTank needs to make sure the information lake is ruled with knowledge high quality guidelines and entry insurance policies in place always.

A number of personas devour knowledge from the information lake, similar to enterprise leaders, knowledge scientists, knowledge analysts, and knowledge engineers. For every set of customers, a distinct degree of governance is required. For instance, enterprise leaders want top-quality and extremely correct knowledge, knowledge scientists can not see PII knowledge and want knowledge inside an appropriate high quality vary for his or her mannequin coaching, and knowledge engineers can see all knowledge besides PII.

At the moment, these necessities are hard-coded and managed manually for every set of customers. OkTank needs to scale this and is searching for methods to manage governance in an automatic method. Primarily, they’re searching for the next options:

When new knowledge and tables get added to the information lake, the governance insurance policies (knowledge high quality checks and entry controls) get routinely utilized for them. Except the information is licensed to be consumed, it shouldn�t be accessible to the end-users. For instance, they need to guarantee fundamental knowledge high quality checks are utilized on all new tables and supply entry to the information based mostly on the information high quality rating.
As a result of modifications in supply knowledge, the present knowledge profile of information lake tables could drift. It�s required to make sure the governance is met as outlined. For instance, the system ought to routinely mark columns as delicate if delicate knowledge is detected in a column that was earlier marked as public and was accessible publicly for customers. The system ought to conceal the column from unauthorized customers accordingly.

For the aim of this submit, the next governance insurance policies are outlined:

No PII knowledge ought to exist in tables or columns tagged as public.
If� a column has any PII knowledge, the column ought to be marked as delicate. The desk ought to then even be marked delicate.
The next knowledge high quality guidelines ought to be utilized on all tables:
- All tables ought to have a minimal set of columns: data_key, data_load_date, and data_location.
- data_key is a key column and will meet key necessities of being distinctive and full.
- data_location ought to match with places outlined in a separate reference (base) desk.
- The data_load_date column ought to be full.
Consumer entry to tables is managed as per the next desk.

Consumer Description	Can Entry Delicate Tables	Can Entry Delicate Columns	Min Information High quality Threshold Wanted to devour Information
Class 1	Sure	Sure	100%
Class 2	Sure	No	50%
Class 3	No	No	0%

On this submit, we use AWS Glue Information High quality and delicate knowledge detection options. We additionally use Lake Formation tag-based entry management to handle entry at scale.

The next diagram illustrates the answer structure.

The governance necessities highlighted within the earlier desk are translated to the next Lake Formation LF-Tags.

IAM Consumer	LF-Tag: tbl_class	LF-Tag: col_class	LF-Tag: dq_tag
Class 1	delicate, public	delicate, public	DQ100
Class 2	delicate, public	public	DQ100,DQ90,DQ50_80,DQ80_90
Class 3	public	public	DQ90, DQ100, DQ_LT_50, DQ50_80, DQ80_90

This submit makes use of AWS Step Features to orchestrate the governance jobs, however you should utilize every other orchestration software of selection. To simulate knowledge ingestion, we manually place the recordsdata in an Amazon Easy Storage Service (Amazon S3) bucket. On this submit, we set off the Step Features state machine manually for ease of understanding. In apply, you may combine or invoke the roles as a part of a knowledge ingestion pipeline, by way of occasion triggers like AWS Glue crawler or Amazon S3 occasions, or schedule them as wanted.

On this submit, we use an AWS Glue database named oktank_autogov_temp and a goal desk named buyer on which we apply the governance guidelines. We use AWS CloudFormation to provision the sources. AWS CloudFormation allows you to mannequin, provision, and handle AWS and third-party sources by treating infrastructure as code.

Conditions

Full the next prerequisite steps:

Establish an AWS Area by which you need to create the sources and make sure you use the identical Area all through the setup and verifications.
Have a Lake Formation administrator position to run the CloudFormation template and grant permissions.

Check in to the Lake Formation console and add your self as a Lake Formation knowledge lake administrator for those who aren�t already an admin. If you’re organising Lake Formation for the primary time in your Area, then you are able to do this within the following pop-up window that seems up while you hook up with the Lake Formation console and choose the specified Area.

In any other case, you may add knowledge lake directors by selecting Administrative roles and duties within the navigation pane on the Lake Formation console and selecting Add directors. Then choose Information lake administrator, identification your customers and roles, and select Affirm.

Deploy the CloudFormation stack

Run the supplied CloudFormation stack to create the answer sources.

That you must present a novel bucket identify and specify passwords for the three customers reflecting three completely different person personas (Class 1, Class 2, and Class 3) that we use for this submit.

The stack provisions an S3 bucket to retailer the dummy knowledge, AWS Glue scripts, outcomes of delicate knowledge detection, and Amazon Athena question ends in their respective folders.

The stack copies the AWS Glue scripts into the scripts folder and creates two AWS Glue jobs Information-High quality-PII-Checker_Job and LF-Tag-Handler_Job pointing to the corresponding scripts.

The AWS Glue job Information-High quality-PII-Checker_Job applies the information high quality guidelines and publishes the outcomes. It additionally checks for delicate knowledge within the columns. On this submit, we verify for the PERSON_NAME and EMAIL knowledge varieties. If any columns with delicate knowledge are detected, it persists the delicate knowledge detection outcomes to the S3 bucket.

AWS Glue Information High quality makes use of Information High quality Definition Language (DQDL) to creator the information high quality guidelines.

The information high quality necessities as outlined earlier on this submit are written as the next DQDL within the script:

Guidelines = [
ReferentialIntegrity "data_location" "reference.data_location" = 1.0,
IsPrimaryKey "data_key",
ColumnExists "data_load_date",
IsComplete "data_load_date
]

The next screenshot exhibits a pattern end result from the job after it runs. You’ll be able to see this after you set off the Step Features workflow in subsequent steps. To verify the outcomes, on the AWS Glue console, select ETL jobs and select the job referred to as Information-High quality-PII-Checker_Job. Then navigate to the Information high quality tab to view the outcomes.

The AWS Glue jobLF-Tag-Handler_Job�fetches the information high quality metrics revealed by Information-High quality-PII-Checker_Job. It checks the standing of the DataQuality_PIIColumns end result. It will get the record of delicate column names from the delicate knowledge detection file created within the Information-High quality-PII-Checker_Job�and tags the columns as delicate. The remainder of the columns are tagged as public. It additionally tags the desk asdelicate if delicate columns are detected. The desk is marked as public if no delicate columns are detected.

The job additionally checks the information high quality rating for the DataQuality_BasicChecks end result set. It maps the information high quality rating into tags as proven within the following desk and applies the corresponding tag on the desk.

Information High quality Rating	Information High quality Tag
100%	DQ100
90-100%	DQ90
80-90%	DQ80_90
50-80%	DQ50_80
Lower than 50%	DQ_LT_50

The CloudFormation stack copies some mock knowledge to the knowledge folder and registers this location below AWS Lake Formation Information lake places so Lake Formation can govern entry on the situation utilizing service-linked position for Lake Formation.

The buyer subfolder incorporates the preliminary buyer dataset for the desk buyer. The base subfolder incorporates the bottom dataset, which we use to verify referential integrity as a part of the information high quality checks. The column data_location within the buyer desk ought to match with places outlined on this base desk.

The stack additionally copies some further mock knowledge to the bucket below the data-v1 folder. We use this knowledge to simulate knowledge high quality points.

It additionally creates the next sources:

An AWS Glue database referred to as oktank_autogov_temp and two tables below the database:
- buyer � That is our goal desk on which we shall be governing the entry based mostly on knowledge high quality guidelines and PII checks.
- base � That is the bottom desk that has the reference knowledge. One of many knowledge high quality guidelines checks that the shopper knowledge all the time adheres to places current within the base desk.
AWS Identification and Entry Administration (IAM) customers and roles:
- DataLakeUser_Category1 � The information lake person comparable to the Class 1 person. This person ought to be capable to entry delicate knowledge however wants 100% correct knowledge.
- DataLakeUser_Category2 � The information lake person comparable to the Class 2 person. This person shouldn’t be in a position to entry delicate columns within the desk. It wants greater than 50% correct knowledge.
- DataLakeUser_Category3 � The information lake person comparable to the Class 3 person. This person shouldn’t be in a position to entry tables containing delicate knowledge. Information high quality may be 0%.
- GlueServiceDQRole � The position for the information high quality and delicate knowledge detection job.
- GlueServiceLFTaggerRole � The position for the LF-Tags handler job for making use of the tags to the desk.
- StepFunctionRole � The Step Features position for triggering the AWS Glue jobs.
Lake Formation LF-Tags keys and values:
- tbl_class � delicate, public
- dq_class � DQ100, DQ90, DQ80_90, DQ50_80, DQ_LT_50
- col_class � delicate, public

A Step Features state machine named AutoGovMachine that you just use to set off the runs for the AWS Glue jobs to verify knowledge high quality and replace the LF-Tags.
Athena workgroups named auto_gov_blog_workgroup_temporary_user1, auto_gov_blog_workgroup_temporary_user2, and auto_gov_blog_workgroup_temporary_user3. These workgroups level to completely different Athena question end result places for every person. Every person is granted entry to the corresponding question end result location solely. This ensures a selected person doesn�t entry the question outcomes of different customers. It’s best to change to a selected workgroup to run queries in Athena as a part of the take a look at for the precise person.

The CloudFormation stack generates the next outputs. Pay attention to the values of the IAM customers to make use of in subsequent steps.

Grant permissions

After you launch the CloudFormation stack, full the next steps:

On the Lake Formation console, below Permissions select Information lake permissions within the navigation pane.
Seek for the database oktank_autogov_temp and desk buyer.
If IAMAllowedPrincipals entry if current, choose it select Revoke.

Select Revoke once more to revoke the permissions.

Class 1 customers can entry all knowledge besides if the information high quality rating of the desk is beneath 100%. Subsequently, we grant the person the mandatory permissions.

Below Permissions within the navigation pane, select Information lake permissions.
Seek for database oktank_autogov_temp and desk buyer.
Select Grant
Choose IAM customers and roles and select the worth for UserCategory1 out of your CloudFormation stack output.
Below LF-Tags or catalog sources, select Add LF-Tag key-value pair.
Add the next key-value pairs:
1. For the col_class key, add the values public and delicate.
2. For the tbl_class key, add the values public and delicate.
3. For the dq_tag key, add the worth DQ100.

For Desk permissions, choose Choose.
Select Grant.

Class 2 customers can�t entry delicate columns. They’ll entry tables with a knowledge high quality rating above 50%.

Repeat the previous steps to grant the suitable permissions in Lake Formation to UserCategory2:
1. For the col_class key, add the worth public.
2. For the tbl_class key, add the values public and delicate.
3. For the dq_tag key, add the values DQ50_80, DQ80_90, DQ90, and DQ100.

For Desk permissions, choose Choose.
Select Grant.

Class 3 customers can�t entry tables that include any delicate columns. Such tables are marked as delicate by the system. They’ll entry tables with any knowledge high quality rating.

Repeat the previous steps to grant the suitable permissions in Lake Formation to UserCategory3:
1. For the col_class key, add the worth public.
2. For the tbl_class key, add the worth public.
3. For the dq_tag key, add the values DQ_LT_50, DQ50_80, DQ80_90, DQ90, and DQ100.

For Desk permissions, choose Choose.
Select Grant.

You’ll be able to confirm the LF-Tag permissions assigned in Lake Formation by navigating to the Information lake permissions web page and trying to find the Useful resource sort LF-Tag expression.

Take a look at the answer

Now we are able to take a look at the workflow. We take a look at three completely different use instances on this submit. You’ll discover how the permissions to the tables change based mostly on the values of LF-Tags utilized to the buyer desk and the columns of the desk. We use Athena to question the tables.

Use case 1

On this first use case, a brand new desk was created on the lake and new knowledge was ingested to the desk. The information file cust_feedback_v0.csv was copied to the knowledge/buyer location within the S3 bucket. This simulates new knowledge ingestion on a brand new desk referred to as buyer.

Lake Formation doesn�t enable any customers to entry this desk at present. To check this situation, full the next steps:

Check in to the Athena console with the UserCategory1 person.
Change the workgroup to auto_gov_blog_workgroup_temporary_user1 within the Athena question editor.
Select Acknowledge to simply accept the workgroup settings.

Run the next question within the question editor:

choose * from "oktank_autogov_temp"."buyer" restrict 10

On the Step Features console, run the AutoGovMachine state machine.
Within the Enter � non-compulsory part, use the next JSON and change the BucketName worth with the bucket identify you used for the CloudFormation stack earlier (for this submit, we use auto-gov-blog):

{
  "Remark": "Auto Governance with AWS Glue and AWS LakeFormation",
  "BucketName": "<Substitute together with your bucket identify>"
}

The state machine triggers the AWS Glue jobs to verify knowledge high quality on the desk and apply the corresponding LF-Tags.

You’ll be able to verify the LF-Tags utilized on the desk and the columns. To take action, when the state machine is full, sign up to Lake Formation with the admin position used earlier to grant permissions.
Navigate to the desk buyer below the oktank_autogov_temp database and select Edit LF-Tags to validate the tags utilized on the desk.

You may also validate that columns customer_email and customer_name are tagged as delicate for the col_class LF-Tag.

To verify this, select Edit Schema for the buyer desk.
Choose the 2 columns and select Edit LF-Tags.

You’ll be able to verify the tags on these columns.

The remainder of the columns are tagged as public.

Check in to the Athena console with UserCategory1 and run the identical question once more:

choose * from "oktank_autogov_temp"."buyer" restrict 10

This time, the person is ready to see the information. It is because the LF-Tag permissions we utilized earlier are in impact.

Check in as UserCategory2 person to confirm permissions.
Change to workgroup auto_gov_blog_workgroup_temporary_user2 in Athena.

This person can entry the desk however can solely see public columns. Subsequently, the person shouldn�t be capable to see the customer_email and customer_phone columns as a result of these columns include delicate knowledge as recognized by the system.

Run the identical question once more:

choose * from "oktank_autogov_temp"."buyer" restrict 10

Check in to Athena and confirm the permissions for DataLakeUser_Category3.
Change to workgroup auto_gov_blog_workgroup_temporary_user3 in Athena.

This person can�t entry the desk as a result of the desk is marked as delicate as a result of presence of delicate knowledge columns within the desk.

Run the identical question once more:

choose * from "oktank_autogov_temp"."buyer" restrict 10

Use case 2

Let�s ingest some new knowledge on the desk.

Check in to the Amazon S3 console with the admin position used earlier to grant permissions.
Copy the file cust_feedback_v1.csv from the data-v1 folder within the S3 bucket to the knowledge/buyer folder within the S3 bucket utilizing the default choices.

This new knowledge file has knowledge high quality points as a result of the column data_location breaks referential integrity with the base desk. This knowledge additionally introduces some delicate knowledge in column comment1. This column was earlier marked as public as a result of it didn�t have any delicate knowledge.

The next screenshot exhibits what the buyer folder ought to appear to be now.

Run the AutoGovMachine state machine once more and use the identical JSON because the StartExecution enter you used earlier:

{
  "Remark": "Auto Governance with AWS Glue and AWS LakeFormation",
  "BucketName": "<Substitute together with your bucket identify>"
}

The job classifies column comment1 as delicate on the buyer desk. It additionally updates the dq_tag worth on the desk as a result of the information high quality has modified as a result of breaking referential integrity verify.

You’ll be able to confirm the brand new tag values by way of the Lake Formation console as described earlier. The dq_tag worth was DQ100. The worth is modified to DQ50_80, reflecting the information high quality rating for the desk.

Additionally, earlier the worth for the col_class tag for the comment1 column was public. The worth is now modified to delicate as a result of delicate knowledge is detected on this column.

Class 2 customers shouldn�t be capable to entry delicate columns within the desk.

Check in with UserCategory2 to Athena and rerun the sooner question:

choose * from "oktank_autogov_temp"."buyer" restrict 10

The column comment1 is not accessible for UserCategory2 as anticipated. The entry permissions are dealt with routinely.

Additionally, as a result of the information high quality rating goes down beneath 100%, this new dataset is not accessible for the Category1 person. This person ought to have entry to knowledge solely when the rating is 100% as per our outlined guidelines.

Check in with UserCategory1 to Athena and rerun the sooner question:

choose * from "oktank_autogov_temp"."buyer" restrict 10

You will note the person is just not in a position to entry the desk now. The entry permissions are dealt with routinely.

Use case 3

Let�s repair the invalid knowledge and take away the information high quality difficulty.

Delete the cust_feedback_v1.csv file from the knowledge/buyer Amazon S3 location.
Copy the file cust_feedback_v1_fixed.csv from the data-v1 folder within the S3 bucket to the knowledge/buyer S3 location. This knowledge file fixes the information high quality points.
Rerun the AutoGovMachine state machine.

When the state machine is full, the information high quality rating goes as much as 100% once more and the tag on the desk will get up to date accordingly. You’ll be able to confirm the brand new tag as proven earlier by way of the Lake Formation console.

The Category1 person can entry the desk once more.

Clear up

To keep away from incurring additional prices, delete the CloudFormation stack to delete the sources provisioned as a part of this submit.

Conclusion

This submit lined AWS Glue Information High quality and delicate detection options and Lake Formation LF-Tag based mostly entry management. We explored how one can mix these options and use them to construct a scalable automated knowledge governance functionality in your knowledge lake. We explored how person permissions modified when knowledge was initially ingested to the desk and when knowledge drift was noticed as a part of subsequent ingestions.

For additional studying, confer with the next sources:

In regards to the Writer

Shoukat Ghouse�is a Senior Massive Information Specialist Options Architect at AWS. He helps clients world wide construct strong, environment friendly and scalable knowledge platforms on AWS leveraging AWS analytics providers like AWS Glue, AWS Lake Formation, Amazon Athena and Amazon EMR.