Finest practices to implement near-real-time analytics utilizing Amazon Redshift Streaming Ingestion with Amazon MSK

March 12, 2024

19

Amazon Redshift is a completely managed, scalable cloud information warehouse that accelerates your time to insights with quick, simple, and safe analytics at scale. Tens of 1000’s of shoppers depend on Amazon Redshift to research exabytes of information and run complicated analytical queries, making it essentially the most extensively used cloud information warehouse. You possibly can run and scale analytics in seconds on all of your information, with out having to handle your information warehouse infrastructure.

You need to use the Amazon Redshift Streaming Ingestion functionality to replace your analytics databases in near-real time. Amazon Redshift streaming ingestion simplifies information pipelines by letting you create materialized views immediately on prime of information streams. With this functionality in Amazon Redshift, you should use Structured Question Language (SQL) to hook up with and immediately ingest information from information streams, corresponding to Amazon Kinesis Information Streams or Amazon Managed Streaming for Apache Kafka (Amazon MSK) information streams, and pull information on to Amazon Redshift.

On this publish, we talk about one of the best practices to implement near-real-time analytics utilizing Amazon Redshift streaming ingestion with Amazon MSK.

Overview of answer

We stroll by way of an instance pipeline to ingest information from an MSK subject into Amazon Redshift utilizing Amazon Redshift streaming ingestion. We additionally present methods to unnest JSON information utilizing dot notation in Amazon Redshift. The next diagram illustrates our answer structure.

The method stream consists of the next steps:

Create a streaming materialized view in your Redshift cluster to eat stay streaming information from the MSK matters.
Use a saved process to implement change information seize (CDC) utilizing the distinctive mixture of Kafka Partition and Kafka Offset on the report stage for the ingested MSK subject.
Create a user-facing desk within the Redshift cluster and use dot notation to unnest the JSON doc from the streaming materialized view into information columns of the desk. You possibly can constantly load contemporary information by calling the saved process at common intervals.
Set up connectivity between an Amazon QuickSight dashboard and Amazon Redshift to ship visualization and insights.

As a part of this publish, we additionally talk about the next matters:

Steps to configure cross-account streaming ingestion from Amazon MSK to Amazon Redshift
Finest practices to realize optimized efficiency from streaming materialized views
Monitoring strategies to trace failures in Amazon Redshift streaming ingestion

Stipulations

You could have the next:

An AWS account.
One of many following sources, relying in your use case:
An MSK cluster. For directions, confer with Create an Amazon MSK cluster.
A subject in your MSK cluster the place your information producer can publish information.
A knowledge producer to put in writing information to the subject in your MSK cluster.

Issues whereas organising your MSK subject

Bear in mind the next issues when configuring your MSK subject:

Be sure that the identify of your MSK subject is not than 128 characters.
As of this writing, MSK data containing compressed information can’t be immediately queried in Amazon Redshift. Amazon Redshift doesn’t help any native decompression strategies for client-side compressed information in an MSK subject.
Observe greatest practices whereas organising your MSK cluster.
Overview the streaming ingestion limitations for every other issues.

Arrange streaming ingestion

To arrange streaming ingestion, full the next steps:

Arrange the AWS Identification and Entry Administration (IAM) function and belief coverage required for streaming ingestion. For directions, confer with the Organising IAM and performing streaming ingestion from Kafka.
Be sure that information is flowing into your MSK subject utilizing Amazon CloudWatch metrics (for instance, BytesOutPerSec).
Launch the question editor v2 from the Amazon Redshift console or use your most popular SQL consumer to hook up with your Redshift cluster for the subsequent steps. The next steps had been run in question editor v2.
Create an exterior schema to map to the MSK cluster. Exchange your IAM function ARN and the MSK cluster ARN within the following assertion:
```
CREATE EXTERNAL SCHEMA custschema
FROM MSK
IAM_ROLE  'iam-role-arn' 
AUTHENTICATION  iam 
CLUSTER_ARN 'msk-cluster-arn';
```
Optionally, in case your subject names are case delicate, it’s essential allow enable_case_sensitive_identifier to have the ability to entry them in Amazon Redshift. To make use of case-sensitive identifiers, set enable_case_sensitive_identifier to true at both the session, consumer, or cluster stage:
```
SET ENABLE_CASE_SENSITIVE_IDENTIFIER TO TRUE;
```

Create a materialized view to eat the stream information from the MSK subject:

CREATE MATERIALIZED VIEW Orders_Stream_MV AS
SELECT kafka_partition, 
 kafka_offset, 
 refresh_time,
 JSON_PARSE(kafka_value) as Information
FROM custschema."ORDERTOPIC"
WHERE CAN_JSON_PARSE(kafka_value);

The metadata column kafka_value that arrives from Amazon MSK is saved in VARBYTE format in Amazon Redshift. For this publish, you utilize the JSON_PARSE operate to transform kafka_value to a SUPER information kind. You additionally use the CAN_JSON_PARSE operate within the filter situation to skip invalid JSON data and guard towards errors because of JSON parsing failures. We talk about methods to retailer the invalid information for future debugging later on this publish.

Refresh the streaming materialized view, which triggers Amazon Redshift to learn from the MSK subject and cargo information into the materialized view:
```
REFRESH MATERIALIZED VIEW Orders_Stream_MV;
```

You may as well set your streaming materialized view to make use of auto refresh capabilities. It will routinely refresh your materialized view as information arrives within the stream. See CREATE MATERIALIZED VIEW for directions to create a materialized view with auto refresh.

Unnest the JSON doc

The next is a pattern of a JSON doc that was ingested from the MSK subject to the Information column of SUPER kind within the streaming materialized view Orders_Stream_MV:

{
   "EventType":"Orders",
   "OrderID":"103",
   "CustomerID":"C104",
   "CustomerName":"David Smith",
   "OrderDate":"2023-09-02",
   "Store_Name":"Retailer-103",
   "ProductID":"P004",
   "ProductName":"Widget-X-003",
   "Quatity":"5",
   "Value":"2500",
   "OrderStatus":"Initiated"
}

Use dot notation as proven within the following code to unnest your JSON payload:

SELECT 
    information."OrderID"::INT4 as OrderID
    ,information."ProductID"::VARCHAR(36) as ProductID
    ,information."ProductName"::VARCHAR(36) as ProductName
    ,information."CustomerID"::VARCHAR(36) as CustomerID
    ,information."CustomerName"::VARCHAR(36) as CustomerName
    ,information."Store_Name"::VARCHAR(36) as Store_Name
    ,information."OrderDate"::TIMESTAMPTZ as OrderDate
    ,information."Quatity"::INT4 as Quatity
    ,information."Value"::DOUBLE PRECISION as Value
    ,information."OrderStatus"::VARCHAR(36) as OrderStatus
    ,"kafka_partition"::BIGINT  
    ,"kafka_offset"::BIGINT
FROM orders_stream_mv;

The next screenshot reveals what the outcome appears like after unnesting.

When you’ve got arrays in your JSON doc, take into account unnesting your information utilizing PartiQL statements in Amazon Redshift. For extra info, confer with the part Unnest the JSON doc within the publish Close to-real-time analytics utilizing Amazon Redshift streaming ingestion with Amazon Kinesis Information Streams and Amazon DynamoDB.

Incremental information load technique

Full the next steps to implement an incremental information load:

Create a desk known as Orders in Amazon Redshift, which end-users will use for visualization and enterprise evaluation:

CREATE TABLE public.Orders (
    orderid integer ENCODE az64,
    productid character various(36) ENCODE lzo,
    productname character various(36) ENCODE lzo,
    customerid character various(36) ENCODE lzo,
    customername character various(36) ENCODE lzo,
    store_name character various(36) ENCODE lzo,
    orderdate timestamp with time zone ENCODE az64,
    quatity integer ENCODE az64,
    worth double precision ENCODE uncooked,
    orderstatus character various(36) ENCODE lzo
) DISTSTYLE AUTO;

Subsequent, you create a saved process known as SP_Orders_Load to implement CDC from a streaming materialized view and cargo into the ultimate Orders desk. You employ the mixture of Kafka_Partition and Kafka_Offset accessible within the streaming materialized view as system columns to implement CDC. The mix of those two columns will at all times be distinctive inside an MSK subject, which makes certain that not one of the data are missed in the course of the course of. The saved process accommodates the next elements:

To make use of case-sensitive identifiers, set enable_case_sensitive_identifier to true at both the session, consumer, or cluster stage.
Refresh the streaming materialized view manually if auto refresh will not be enabled.
Create an audit desk known as Orders_Streaming_Audit if it doesn’t exist to maintain observe of the final offset for a partition that was loaded into Orders desk over the last run of the saved process.
Unnest and insert solely new or modified information right into a staging desk known as Orders_Staging_Table, studying from the streaming materialized view Orders_Stream_MV, the place Kafka_Offset is larger than the final processed Kafka_Offset recorded within the audit desk Orders_Streaming_Audit for the Kafka_Partition being processed.
When loading for the primary time utilizing this saved process, there will probably be no information within the Orders_Streaming_Audit desk and all the information from Orders_Stream_MV will get loaded into the Orders desk.
Insert solely business-relevant columns to the user-facing Orders desk, deciding on from the staging desk Orders_Staging_Table.
Insert the max Kafka_Offset for each loaded Kafka_Partition into the audit desk Orders_Streaming_Audit

We’ve added the intermediate staging desk Orders_Staging_Table on this answer to assist with the debugging in case of sudden failures and trackability. Skipping the staging step and immediately loading into the ultimate desk from Orders_Stream_MV can present decrease latency relying in your use case.

Create the saved process with the next code:

CREATE OR REPLACE PROCEDURE SP_Orders_Load()
    AS $$
    BEGIN

    SET ENABLE_CASE_SENSITIVE_IDENTIFIER TO TRUE;
    REFRESH MATERIALIZED VIEW Orders_Stream_MV;

    --create an audit desk if not exists to maintain observe of Max Offset per Partition that was loaded into Orders desk  

    CREATE TABLE IF NOT EXISTS Orders_Streaming_Audit
    (
    "kafka_partition" BIGINT,
    "kafka_offset" BIGINT
    )
    SORTKEY("kafka_partition", "kafka_offset"); 

    DROP TABLE IF EXISTS Orders_Staging_Table;  

    --Insert solely newly accessible information into staging desk from streaming View based mostly on the max offset for brand spanking new/current partitions
  --When loading for 1st time i.e. there isn't any information in Orders_Streaming_Audit desk then all the information will get loaded from streaming View  
    CREATE TABLE Orders_Staging_Table as 
    SELECT 
    information."OrderID"."N"::INT4 as OrderID
    ,information."ProductID"."S"::VARCHAR(36) as ProductID
    ,information."ProductName"."S"::VARCHAR(36) as ProductName
    ,information."CustomerID"."S"::VARCHAR(36) as CustomerID
    ,information."CustomerName"."S"::VARCHAR(36) as CustomerName
    ,information."Store_Name"."S"::VARCHAR(36) as Store_Name
    ,information."OrderDate"."S"::TIMESTAMPTZ as OrderDate
    ,information."Quatity"."N"::INT4 as Quatity
    ,information."Value"."N"::DOUBLE PRECISION as Value
    ,information."OrderStatus"."S"::VARCHAR(36) as OrderStatus
    , s."kafka_partition"::BIGINT , s."kafka_offset"::BIGINT
    FROM Orders_Stream_MV s
    LEFT JOIN (
    SELECT
    "kafka_partition",
    MAX("kafka_offset") AS "kafka_offset"
    FROM Orders_Streaming_Audit
    GROUP BY "kafka_partition"
    ) AS m
    ON nvl(s."kafka_partition",0) = nvl(m."kafka_partition",0)
    WHERE
    m."kafka_offset" IS NULL OR
    s."kafka_offset" > m."kafka_offset";

    --Insert solely enterprise related column to ultimate desk deciding on from staging desk
    Insert into Orders 
    SELECT 
    OrderID
    ,ProductID
    ,ProductName
    ,CustomerID
    ,CustomerName
    ,Store_Name
    ,OrderDate
    ,Quatity
    ,Value
    ,OrderStatus
    FROM Orders_Staging_Table;

    --Insert the max kafka_offset for each loaded Kafka partitions into Audit desk 
    INSERT INTO Orders_Streaming_Audit
    SELECT kafka_partition, MAX(kafka_offset)
    FROM Orders_Staging_Table
    GROUP BY kafka_partition;   

    END;
    $$ LANGUAGE plpgsql;

Run the saved process to load information into the Orders desk:
Validate information within the Orders desk.

Set up cross-account streaming ingestion

In case your MSK cluster belongs to a special account, full the next steps to create IAM roles to arrange cross-account streaming ingestion. Let’s assume the Redshift cluster is in account A and the MSK cluster is in account B, as proven within the following diagram.

Full the next steps:

In account B, create an IAM function known as MyRedshiftMSKRole that enables Amazon Redshift (account A) to speak with the MSK cluster (account B) named MyTestCluster. Relying on whether or not your MSK cluster makes use of IAM authentication or unauthenticated entry to attach, it’s essential create an IAM function with one of many following insurance policies:

An IAM policAmazonAmazon MSK utilizing unauthenticated entry:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "RedshiftMSKPolicy",
            "Effect": "Allow",
            "Action": [
                "kafka:GetBootstrapBrokers"
            ],
            "Useful resource": "*"
        }
    ]
}

An IAM coverage for Amazon MSK when utilizing IAM authentication:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "RedshiftMSKIAMpolicy",
            "Effect": "Allow",
            "Action": [
                "kafka-cluster:ReadData",
                "kafka-cluster:DescribeTopic",
                "kafka-cluster:Connect"
            ],
            "Useful resource": [
                "arn:aws:kafka:us-east-1:0123456789:cluster/MyTestCluster/abcd1234-0123-abcd-5678-1234abcd-1",
                "arn:aws:kafka:us-east-1:0123456789:topic/MyTestCluster/*"
            ]
        },
        {
            "Sid": "RedshiftMSKPolicy",
            "Impact": "Enable",
            "Motion": [
                "kafka:GetBootstrapBrokers"
            ],
            "Useful resource": "*"
        }
    ]
}

The useful resource part within the previous instance offers entry to all matters within the MyTestCluster MSK cluster. If it’s essential prohibit the IAM function to particular matters, it’s essential exchange the subject useful resource with a extra restrictive useful resource coverage.

After you create the IAM function in account B, be aware of the IAM function ARN (for instance, arn:aws:iam::0123456789:function/MyRedshiftMSKRole).
In account A, create a Redshift customizable IAM function known as MyRedshiftRole, that Amazon Redshift will assume when connecting to Amazon MSK. The function ought to have a coverage like the next, which permits the Amazon Redshift IAM Position in account A to imagine the Amazon MSK function in account B:
```
{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "RedshiftMSKAssumePolicy",
            "Effect": "Allow",
            "Action": "sts:AssumeRole",
            "Resource": "arn:aws:iam::0123456789:role/MyRedshiftMSKRole"        
       }
    ]
}
```
Be aware of the function ARN for the Amazon Redshift IAM function (for instance, arn:aws:iam::9876543210:function/MyRedshiftRole).
Return to account B and add this function within the belief coverage of the IAM function arn:aws:iam::0123456789:function/MyRedshiftMSKRole to permit account B to belief the IAM function from account A. The belief coverage ought to seem like the next code:
```
{
  "Model": "2012-10-17",
  "Assertion": [
    {
      "Effect": "Allow",
      "Action": "sts:AssumeRole",
      "Principal": {
        "AWS": "arn:aws:iam::9876543210:role/MyRedshiftRole"
      }
    }
  ]
} 
```
Sign up to the Amazon Redshift console as account A.
Launch the question editor v2 or your most popular SQL consumer and run the next statements to entry the MSK subject in account B. To map to the MSK cluster, create an exterior schema utilizing function chaining by specifying IAM function ARNs, separated by a comma with none areas round it. The function hooked up to the Redshift cluster comes first within the chain.
```
CREATE EXTERNAL SCHEMA custschema
FROM MSK
IAM_ROLE  
'arn:aws:iam::9876543210:function/MyRedshiftRole,arn:aws:iam::0123456789:function/MyRedshiftMSKRole' 
AUTHENTICATION  iam 
CLUSTER_ARN 'msk-cluster-arn'; --replace with ARN of MSK cluster 
```

Efficiency issues

Bear in mind the next efficiency issues:

Preserve the streaming materialized view easy and transfer transformations like unnesting, aggregation, and case expressions to a later step—for instance, by creating one other materialized view on prime of the streaming materialized view.
Take into account creating just one streaming materialized view in a single Redshift cluster or workgroup for a given MSK subject. Creation of a number of materialized views per MSK subject can decelerate the ingestion efficiency as a result of every materialized view turns into a shopper for that subject and shares the Amazon MSK bandwidth for that subject. Stay streaming information in a streaming materialized view may be shared throughout a number of Redshift clusters or Redshift Serverless workgroups utilizing information sharing.
Whereas defining your streaming materialized view, keep away from utilizing Json_Extract_Path_Text to pre-shred information, as a result of Json_extract_path_text operates on the information row by row, which considerably impacts ingestion throughput. It’s preferable to land the information as is from the stream after which shred it later.
The place doable, take into account skipping the kind key within the streaming materialized view to speed up the ingestion pace. When a streaming materialized view has a form key, a form operation will happen with each batch of ingested information from the stream. Sorting has a efficiency overheard relying on the kind key information kind, variety of kind key columns, and quantity of information ingested in every batch. This sorting step can improve the latency earlier than the streaming information is on the market to question. It’s best to weigh which is extra essential: latency on ingestion or latency on querying the information.
For optimized efficiency of the streaming materialized view and to cut back storage utilization, sometimes purge information from the materialized view utilizing delete, truncate, or alter desk append.
If it’s essential ingest a number of MSK matters in parallel into Amazon Redshift, begin with a smaller variety of streaming materialized views and hold including extra materialized views to judge the general ingestion efficiency inside a cluster or workgroup.
Growing the variety of nodes in a Redshift provisioned cluster or the bottom RPU of a Redshift Serverless workgroup will help enhance the ingestion efficiency of a streaming materialized view. For optimum efficiency, you must purpose to have as many slices in your Redshift provisioned cluster as there are partitions in your MSK subject, or 8 RPU for each 4 partitions in your MSK subject.

Monitoring strategies

Data within the subject that exceed the scale of the goal materialized view column on the time of ingestion will probably be skipped. Data which can be skipped by the materialized view refresh will probably be logged within the SYS_STREAM_SCAN_ERRORS system desk.

Errors that happen when processing a report because of a calculation or a knowledge kind conversion or another logic within the materialized view definition will outcome within the materialized view refresh failure till the offending report has expired from the subject. To keep away from all these points, take a look at the logic of your materialized view definition rigorously; in any other case, land the data into the default VARBYTE column and course of them later.

The next can be found monitoring views:

SYS_MV_REFRESH_HISTORY – Use this view to collect details about the refresh historical past of your streaming materialized views. The outcomes embody the refresh kind, corresponding to handbook or auto, and the standing of the latest refresh. The next question reveals the refresh historical past for a streaming materialized view:
```
choose mv_name, refresh_type, standing, period  from SYS_MV_REFRESH_HISTORY the place mv_name="mv_store_sales"
```
SYS_STREAM_SCAN_ERRORS – Use this view to test the explanation why a report did not load by way of streaming ingestion from an MSK subject. As of penning this publish, when ingesting from Amazon MSK, this view solely logs errors when the report is bigger than the materialized view column measurement. This view may also present the distinctive identifier (offset) of the MSK report within the place column. The next question reveals the error code and error purpose when a report exceeded the utmost measurement restrict:
```
choose mv_name, external_schema_name, stream_name, record_time, query_id, partition_id, "place", error_code, error_reason
from SYS_STREAM_SCAN_ERRORS  the place mv_name="test_mv" and external_schema_name="streaming_schema"	;
```
SYS_STREAM_SCAN_STATES – Use this view to observe the variety of data scanned at a given record_time. This view additionally tracks the offset of the final report learn within the batch. The next question reveals subject information for a particular materialized view:
```
choose mv_name,external_schema_name,stream_name,sum(scanned_rows) total_records,
sum(scanned_bytes) total_bytes 
from SYS_STREAM_SCAN_STATES the place mv_name="test_mv" and external_schema_name="streaming_schema" group by 1,2,3;
```
SYS_QUERY_HISTORY – Use this view to test the general metrics for a streaming materialized view refresh. This may also log errors within the error_message column for errors that don’t present up in SYS_STREAM_SCAN_ERRORS. The next question reveals the error inflicting the refresh failure of a streaming materialized view:
```
choose  query_id, query_type, standing, query_text, error_message from sys_query_history the place standing="failed" and start_time>='2024-02-03 03:18:00' order by start_time desc
```

Further issues for implementation

You’ve the selection to optionally generate a materialized view on prime of a streaming materialized view, permitting you to unnest and precompute outcomes for end-users. This method eliminates the necessity to retailer the ends in a ultimate desk utilizing a saved process.

On this publish, you utilize the CAN_JSON_PARSE operate to protect towards any errors to extra efficiently ingest information—on this case, the streaming data that may’t be parsed are skipped by Amazon Redshift. Nevertheless, if you wish to hold observe of your error data, take into account storing them in a column utilizing the next SQL when creating the streaming materialized view:

CREATE MATERIALIZED VIEW Orders_Stream_MV AS 
SELECT
kafka_partition, 
kafka_offset, 
refresh_time, 
JSON_PARSE(kafka_value) as Information 
case when CAN_JSON_PARSE(kafka_value) = true then json_parse(kafka_value) finish Information,
case when CAN_JSON_PARSE(kafka_value) = false then kafka_value finish Invalid_Data
FROM custschema."ORDERTOPIC";

You may as well take into account unloading information from the view SYS_STREAM_SCAN_ERRORS into an Amazon Easy Storage Service (Amazon S3) bucket and get alerts by sending a report by way of e mail utilizing Amazon Easy Notification Service (Amazon SNS) notifications at any time when a brand new S3 object is created.

Lastly, based mostly in your information freshness requirement, you should use Amazon EventBridge to schedule the roles in your information warehouse to name the aforementioned SP_Orders_Load saved process regularly. EventBridge does this at fastened intervals, and it’s possible you’ll must have a mechanism (for instance, an AWS Step Capabilities state machine) to observe if the earlier name to the process accomplished. For extra info, confer with Creating an Amazon EventBridge rule that runs on a schedule. You may as well confer with Speed up orchestration of an ELT course of utilizing AWS Step Capabilities and Amazon Redshift Information API. An alternative choice is to make use of Amazon Redshift question editor v2 to schedule the refresh. For particulars, confer with Scheduling a question with question editor v2.

Conclusion

On this publish, we mentioned greatest practices to implement near-real-time analytics utilizing Amazon Redshift streaming ingestion with Amazon MSK. We confirmed you an instance pipeline to ingest information from an MSK subject into Amazon Redshift utilizing streaming ingestion. We additionally confirmed a dependable technique to carry out incremental streaming information load into Amazon Redshift utilizing Kafka Partition and Kafka Offset. Moreover, we demonstrated the steps to configure cross-account streaming ingestion from Amazon MSK to Amazon Redshift and mentioned efficiency issues for optimized ingestion fee. Lastly, we mentioned monitoring strategies to trace failures in Amazon Redshift streaming ingestion.

When you’ve got any questions, go away them within the feedback part.

In regards to the Authors

Poulomi Dasgupta is a Senior Analytics Options Architect with AWS. She is keen about serving to clients construct cloud-based analytics options to unravel their enterprise issues. Outdoors of labor, she likes travelling and spending time along with her household.

Adekunle Adedotun is a Sr. Database Engineer with Amazon Redshift service. He has been engaged on MPP databases for six years with a give attention to efficiency tuning. He additionally supplies steerage to the event workforce for brand spanking new and current service options.