Information governance is the method of making certain the integrity, availability, usability, and safety of a company’s knowledge. As a result of quantity, velocity, and number of knowledge being ingested in knowledge lakes, it might probably get difficult to develop and keep insurance policies and procedures to make sure knowledge governance at scale in your knowledge lake. Information confidentiality and knowledge high quality are the 2 important themes for knowledge governance. Information confidentiality refers back to the safety and management of delicate and personal info to stop unauthorized entry, particularly when coping with personally identifiable info (PII). Information high quality focuses on sustaining correct, dependable, and constant knowledge throughout the group. Poor knowledge high quality can result in misguided selections, inefficient operations, and compromised enterprise efficiency.
Firms want to make sure knowledge confidentiality is maintained all through the information pipeline and that high-quality knowledge is offered to shoppers in a well timed method. Quite a lot of this effort is guide, the place knowledge homeowners and knowledge stewards outline and apply the insurance policies statically up entrance for every dataset within the lake. This will get tedious and delays the information adoption throughout the enterprise.
On this submit, we showcase tips on how to use AWS Glue with AWS Glue Information High quality, delicate knowledge detection transforms, and AWS Lake Formation tag-based entry management to automate knowledge governance.
Resolution overview
Let’s take into account a fictional firm, OkTank. OkTank has a number of ingestion pipelines that populate a number of tables within the knowledge lake. OkTank needs to make sure the information lake is ruled with knowledge high quality guidelines and entry insurance policies in place always.
A number of personas devour knowledge from the information lake, similar to enterprise leaders, knowledge scientists, knowledge analysts, and knowledge engineers. For every set of customers, a distinct degree of governance is required. For instance, enterprise leaders want top-quality and extremely correct knowledge, knowledge scientists can not see PII knowledge and want knowledge inside an appropriate high quality vary for his or her mannequin coaching, and knowledge engineers can see all knowledge besides PII.
At the moment, these necessities are hard-coded and managed manually for every set of customers. OkTank needs to scale this and is searching for methods to manage governance in an automatic method. Primarily, they’re searching for the next options:
- When new knowledge and tables get added to the information lake, the governance insurance policies (knowledge high quality checks and entry controls) get routinely utilized for them. Except the information is licensed to be consumed, it shouldn’t be accessible to the end-users. For instance, they need to guarantee fundamental knowledge high quality checks are utilized on all new tables and supply entry to the information based mostly on the information high quality rating.
- As a result of modifications in supply knowledge, the present knowledge profile of information lake tables could drift. It’s required to make sure the governance is met as outlined. For instance, the system ought to routinely mark columns as delicate if delicate knowledge is detected in a column that was earlier marked as public and was accessible publicly for customers. The system ought to conceal the column from unauthorized customers accordingly.
For the aim of this submit, the next governance insurance policies are outlined:
- No PII knowledge ought to exist in tables or columns tagged as
public
. - If a column has any PII knowledge, the column ought to be marked as
delicate
. The desk ought to then even be markeddelicate
. - The next knowledge high quality guidelines ought to be utilized on all tables:
- All tables ought to have a minimal set of columns:
data_key
,data_load_date
, anddata_location
. data_key
is a key column and will meet key necessities of being distinctive and full.data_location
ought to match with places outlined in a separate reference (base) desk.- The
data_load_date
column ought to be full.
- All tables ought to have a minimal set of columns:
- Consumer entry to tables is managed as per the next desk.
Consumer Description | Can Entry Delicate Tables | Can Entry Delicate Columns | Min Information High quality Threshold Wanted to devour Information |
Class 1 | Sure | Sure | 100% |
Class 2 | Sure | No | 50% |
Class 3 | No | No | 0% |
On this submit, we use AWS Glue Information High quality and delicate knowledge detection options. We additionally use Lake Formation tag-based entry management to handle entry at scale.
The next diagram illustrates the answer structure.
The governance necessities highlighted within the earlier desk are translated to the next Lake Formation LF-Tags.
IAM Consumer | LF-Tag: tbl_class | LF-Tag: col_class | LF-Tag: dq_tag |
Class 1 | delicate, public | delicate, public | DQ100 |
Class 2 | delicate, public | public | DQ100,DQ90,DQ50_80,DQ80_90 |
Class 3 | public | public | DQ90, DQ100, DQ_LT_50, DQ50_80, DQ80_90 |
This submit makes use of AWS Step Features to orchestrate the governance jobs, however you should utilize every other orchestration software of selection. To simulate knowledge ingestion, we manually place the recordsdata in an Amazon Easy Storage Service (Amazon S3) bucket. On this submit, we set off the Step Features state machine manually for ease of understanding. In apply, you may combine or invoke the roles as a part of a knowledge ingestion pipeline, by way of occasion triggers like AWS Glue crawler or Amazon S3 occasions, or schedule them as wanted.
On this submit, we use an AWS Glue database named oktank_autogov_temp
and a goal desk named buyer
on which we apply the governance guidelines. We use AWS CloudFormation to provision the sources. AWS CloudFormation allows you to mannequin, provision, and handle AWS and third-party sources by treating infrastructure as code.
Conditions
Full the next prerequisite steps:
- Establish an AWS Area by which you need to create the sources and make sure you use the identical Area all through the setup and verifications.
- Have a Lake Formation administrator position to run the CloudFormation template and grant permissions.
Check in to the Lake Formation console and add your self as a Lake Formation knowledge lake administrator for those who aren’t already an admin. If you’re organising Lake Formation for the primary time in your Area, then you are able to do this within the following pop-up window that seems up while you hook up with the Lake Formation console and choose the specified Area.
In any other case, you may add knowledge lake directors by selecting Administrative roles and duties within the navigation pane on the Lake Formation console and selecting Add directors. Then choose Information lake administrator, identification your customers and roles, and select Affirm.
Deploy the CloudFormation stack
Run the supplied CloudFormation stack to create the answer sources.
That you must present a novel bucket identify and specify passwords for the three customers reflecting three completely different person personas (Class 1, Class 2, and Class 3) that we use for this submit.
The stack provisions an S3 bucket to retailer the dummy knowledge, AWS Glue scripts, outcomes of delicate knowledge detection, and Amazon Athena question ends in their respective folders.
The stack copies the AWS Glue scripts into the scripts
folder and creates two AWS Glue jobs Information-High quality-PII-Checker_Job
and LF-Tag-Handler_Job
pointing to the corresponding scripts.
The AWS Glue job Information-High quality-PII-Checker_Job
applies the information high quality guidelines and publishes the outcomes. It additionally checks for delicate knowledge within the columns. On this submit, we verify for the PERSON_NAME
and EMAIL
knowledge varieties. If any columns with delicate knowledge are detected, it persists the delicate knowledge detection outcomes to the S3 bucket.
AWS Glue Information High quality makes use of Information High quality Definition Language (DQDL) to creator the information high quality guidelines.
The information high quality necessities as outlined earlier on this submit are written as the next DQDL within the script:
The next screenshot exhibits a pattern end result from the job after it runs. You’ll be able to see this after you set off the Step Features workflow in subsequent steps. To verify the outcomes, on the AWS Glue console, select ETL jobs and select the job referred to as Information-High quality-PII-Checker_Job
. Then navigate to the Information high quality tab to view the outcomes.
The AWS Glue jobLF-Tag-Handler_Job
fetches the information high quality metrics revealed by Information-High quality-PII-Checker_Job
. It checks the standing of the DataQuality_PIIColumns
end result. It will get the record of delicate column names from the delicate knowledge detection file created within the Information-High quality-PII-Checker_Job
and tags the columns as delicate
. The remainder of the columns are tagged as public
. It additionally tags the desk asdelicate
if delicate columns are detected. The desk is marked as public
if no delicate columns are detected.
The job additionally checks the information high quality rating for the DataQuality_BasicChecks
end result set. It maps the information high quality rating into tags as proven within the following desk and applies the corresponding tag on the desk.
Information High quality Rating | Information High quality Tag |
100% | DQ100 |
90-100% | DQ90 |
80-90% | DQ80_90 |
50-80% | DQ50_80 |
Lower than 50% | DQ_LT_50 |
The CloudFormation stack copies some mock knowledge to the knowledge
folder and registers this location below AWS Lake Formation Information lake places so Lake Formation can govern entry on the situation utilizing service-linked position for Lake Formation.
The buyer
subfolder incorporates the preliminary buyer dataset for the desk buyer
. The base
subfolder incorporates the bottom dataset, which we use to verify referential integrity as a part of the information high quality checks. The column data_location
within the buyer
desk ought to match with places outlined on this base
desk.
The stack additionally copies some further mock knowledge to the bucket below the data-v1
folder. We use this knowledge to simulate knowledge high quality points.
It additionally creates the next sources:
- An AWS Glue database referred to as
oktank_autogov_temp
and two tables below the database:- buyer – That is our goal desk on which we shall be governing the entry based mostly on knowledge high quality guidelines and PII checks.
- base – That is the bottom desk that has the reference knowledge. One of many knowledge high quality guidelines checks that the shopper knowledge all the time adheres to places current within the base desk.
- AWS Identification and Entry Administration (IAM) customers and roles:
- DataLakeUser_Category1 – The information lake person comparable to the Class 1 person. This person ought to be capable to entry delicate knowledge however wants 100% correct knowledge.
- DataLakeUser_Category2 – The information lake person comparable to the Class 2 person. This person shouldn’t be in a position to entry delicate columns within the desk. It wants greater than 50% correct knowledge.
- DataLakeUser_Category3 – The information lake person comparable to the Class 3 person. This person shouldn’t be in a position to entry tables containing delicate knowledge. Information high quality may be 0%.
- GlueServiceDQRole – The position for the information high quality and delicate knowledge detection job.
- GlueServiceLFTaggerRole – The position for the LF-Tags handler job for making use of the tags to the desk.
- StepFunctionRole – The Step Features position for triggering the AWS Glue jobs.
- Lake Formation LF-Tags keys and values:
- tbl_class –
delicate
,public
- dq_class –
DQ100
,DQ90
,DQ80_90
,DQ50_80
,DQ_LT_50
- col_class –
delicate
,public
- tbl_class –
- A Step Features state machine named
AutoGovMachine
that you just use to set off the runs for the AWS Glue jobs to verify knowledge high quality and replace the LF-Tags. - Athena workgroups named
auto_gov_blog_workgroup_temporary_user1
,auto_gov_blog_workgroup_temporary_user2
, andauto_gov_blog_workgroup_temporary_user3
. These workgroups level to completely different Athena question end result places for every person. Every person is granted entry to the corresponding question end result location solely. This ensures a selected person doesn’t entry the question outcomes of different customers. It’s best to change to a selected workgroup to run queries in Athena as a part of the take a look at for the precise person.
The CloudFormation stack generates the next outputs. Pay attention to the values of the IAM customers to make use of in subsequent steps.
Grant permissions
After you launch the CloudFormation stack, full the next steps:
- On the Lake Formation console, below Permissions select Information lake permissions within the navigation pane.
- Seek for the database
oktank_autogov_temp
and deskbuyer
. - If
IAMAllowedPrincipals
entry if current, choose it select Revoke.
- Select Revoke once more to revoke the permissions.
Class 1 customers can entry all knowledge besides if the information high quality rating of the desk is beneath 100%. Subsequently, we grant the person the mandatory permissions.
- Below Permissions within the navigation pane, select Information lake permissions.
- Seek for database
oktank_autogov_temp
and deskbuyer
. - Select Grant
- Choose IAM customers and roles and select the worth for
UserCategory1
out of your CloudFormation stack output. - Below LF-Tags or catalog sources, select Add LF-Tag key-value pair.
- Add the next key-value pairs:
- For the
col_class
key, add the valuespublic
anddelicate
. - For the
tbl_class
key, add the valuespublic
anddelicate
. - For the
dq_tag
key, add the worthDQ100
.
- For the
- For Desk permissions, choose Choose.
- Select Grant.
Class 2 customers can’t entry delicate columns. They’ll entry tables with a knowledge high quality rating above 50%.
- Repeat the previous steps to grant the suitable permissions in Lake Formation to
UserCategory2
:- For the
col_class
key, add the worthpublic
. - For the
tbl_class
key, add the valuespublic
anddelicate
. - For the
dq_tag
key, add the valuesDQ50_80
,DQ80_90
,DQ90
, andDQ100
.
- For the
- For Desk permissions, choose Choose.
- Select Grant.
Class 3 customers can’t entry tables that include any delicate columns. Such tables are marked as delicate
by the system. They’ll entry tables with any knowledge high quality rating.
- Repeat the previous steps to grant the suitable permissions in Lake Formation to UserCategory3:
- For the
col_class
key, add the worthpublic
. - For the
tbl_class
key, add the worthpublic
. - For the
dq_tag
key, add the valuesDQ_LT_50
,DQ50_80
,DQ80_90
,DQ90
, andDQ100
.
- For the
- For Desk permissions, choose Choose.
- Select Grant.
You’ll be able to confirm the LF-Tag permissions assigned in Lake Formation by navigating to the Information lake permissions web page and trying to find the Useful resource sort LF-Tag expression
.
Take a look at the answer
Now we are able to take a look at the workflow. We take a look at three completely different use instances on this submit. You’ll discover how the permissions to the tables change based mostly on the values of LF-Tags utilized to the buyer
desk and the columns of the desk. We use Athena to question the tables.
Use case 1
On this first use case, a brand new desk was created on the lake and new knowledge was ingested to the desk. The information file cust_feedback_v0.csv
was copied to the knowledge/buyer
location within the S3 bucket. This simulates new knowledge ingestion on a brand new desk referred to as buyer
.
Lake Formation doesn’t enable any customers to entry this desk at present. To check this situation, full the next steps:
- Check in to the Athena console with the
UserCategory1
person. - Change the workgroup to
auto_gov_blog_workgroup_temporary_user1
within the Athena question editor. - Select Acknowledge to simply accept the workgroup settings.
- Run the next question within the question editor:
- On the Step Features console, run the
AutoGovMachine
state machine. - Within the Enter – non-compulsory part, use the next JSON and change the
BucketName
worth with the bucket identify you used for the CloudFormation stack earlier (for this submit, we useauto-gov-blog
):
The state machine triggers the AWS Glue jobs to verify knowledge high quality on the desk and apply the corresponding LF-Tags.
- You’ll be able to verify the LF-Tags utilized on the desk and the columns. To take action, when the state machine is full, sign up to Lake Formation with the admin position used earlier to grant permissions.
- Navigate to the desk
buyer
below theoktank_autogov_temp
database and select Edit LF-Tags to validate the tags utilized on the desk.
You may also validate that columns customer_email
and customer_name
are tagged as delicate for the col_class
LF-Tag.
- To verify this, select Edit Schema for the
buyer
desk. - Choose the 2 columns and select Edit LF-Tags.
You’ll be able to verify the tags on these columns.
The remainder of the columns are tagged as public
.
- Check in to the Athena console with
UserCategory1
and run the identical question once more:
This time, the person is ready to see the information. It is because the LF-Tag permissions we utilized earlier are in impact.
- Check in as
UserCategory2
person to confirm permissions. - Change to workgroup
auto_gov_blog_workgroup_temporary_user2
in Athena.
This person can entry the desk however can solely see public columns. Subsequently, the person shouldn’t be capable to see the customer_email
and customer_phone
columns as a result of these columns include delicate knowledge as recognized by the system.
- Run the identical question once more:
- Check in to Athena and confirm the permissions for
DataLakeUser_Category3
. - Change to workgroup
auto_gov_blog_workgroup_temporary_user3
in Athena.
This person can’t entry the desk as a result of the desk is marked as delicate
as a result of presence of delicate knowledge columns within the desk.
- Run the identical question once more:
Use case 2
Let’s ingest some new knowledge on the desk.
- Check in to the Amazon S3 console with the admin position used earlier to grant permissions.
- Copy the file
cust_feedback_v1.csv
from thedata-v1
folder within the S3 bucket to theknowledge/buyer
folder within the S3 bucket utilizing the default choices.
This new knowledge file has knowledge high quality points as a result of the column data_location
breaks referential integrity with the base
desk. This knowledge additionally introduces some delicate knowledge in column comment1
. This column was earlier marked as public
as a result of it didn’t have any delicate knowledge.
The next screenshot exhibits what the buyer
folder ought to appear to be now.
- Run the AutoGovMachine state machine once more and use the identical JSON because the StartExecution enter you used earlier:
The job classifies column comment1
as delicate
on the buyer
desk. It additionally updates the dq_tag
worth on the desk as a result of the information high quality has modified as a result of breaking referential integrity verify.
You’ll be able to confirm the brand new tag values by way of the Lake Formation console as described earlier. The dq_tag
worth was DQ100
. The worth is modified to DQ50_80
, reflecting the information high quality rating for the desk.
Additionally, earlier the worth for the col_class
tag for the comment1
column was public
. The worth is now modified to delicate
as a result of delicate knowledge is detected on this column.
Class 2 customers shouldn’t be capable to entry delicate columns within the desk.
- Check in with
UserCategory2
to Athena and rerun the sooner question:
The column comment1
is not accessible for UserCategory2
as anticipated. The entry permissions are dealt with routinely.
Additionally, as a result of the information high quality rating goes down beneath 100%, this new dataset is not accessible for the Category1
person. This person ought to have entry to knowledge solely when the rating is 100% as per our outlined guidelines.
- Check in with
UserCategory1
to Athena and rerun the sooner question:
You will note the person is just not in a position to entry the desk now. The entry permissions are dealt with routinely.
Use case 3
Let’s repair the invalid knowledge and take away the information high quality difficulty.
- Delete the
cust_feedback_v1.csv
file from theknowledge/buyer
Amazon S3 location. - Copy the file
cust_feedback_v1_fixed.csv
from thedata-v1
folder within the S3 bucket to theknowledge/buyer
S3 location. This knowledge file fixes the information high quality points. - Rerun the
AutoGovMachine
state machine.
When the state machine is full, the information high quality rating goes as much as 100% once more and the tag on the desk will get up to date accordingly. You’ll be able to confirm the brand new tag as proven earlier by way of the Lake Formation console.
The Category1
person can entry the desk once more.
Clear up
To keep away from incurring additional prices, delete the CloudFormation stack to delete the sources provisioned as a part of this submit.
Conclusion
This submit lined AWS Glue Information High quality and delicate detection options and Lake Formation LF-Tag based mostly entry management. We explored how one can mix these options and use them to construct a scalable automated knowledge governance functionality in your knowledge lake. We explored how person permissions modified when knowledge was initially ingested to the desk and when knowledge drift was noticed as a part of subsequent ingestions.
For additional studying, confer with the next sources:
In regards to the Writer
Shoukat Ghouse is a Senior Massive Information Specialist Options Architect at AWS. He helps clients world wide construct strong, environment friendly and scalable knowledge platforms on AWS leveraging AWS analytics providers like AWS Glue, AWS Lake Formation, Amazon Athena and Amazon EMR.