How VMware Tanzu CloudHealth migrated from self-managed Kafka to Amazon MSK

March 14, 2024

21

It is a submit co-written with Rivlin Pereira & Vaibhav Pandey from Tanzu CloudHealth (VMware by Broadcom).

VMware Tanzu CloudHealth is the cloud price administration platform of selection for greater than 20,000 organizations worldwide, who depend on it to optimize and govern their largest and most advanced multi-cloud environments. On this submit, we talk about how the VMware Tanzu CloudHealth DevOps crew migrated their self-managed Apache Kafka workloads (operating model 2.0) to Amazon Managed Streaming for Apache Kafka (Amazon MSK) operating model 2.6.2. We talk about the system architectures, deployment pipelines, matter creation, observability, entry management, matter migration, and all the problems we confronted with the prevailing infrastructure, together with how and why we migrated to the brand new Kafka setup and a few classes realized.

Kafka cluster overview

Within the fast-evolving panorama of distributed methods, VMware Tanzu CloudHealth’s next-generation microservices platform depends on Kafka as its messaging spine. For us, Kafka’s high-performance distributed log system excels in dealing with huge information streams, making it indispensable for seamless communication. Serving as a distributed log system, Kafka effectively captures and shops various logs, from HTTP server entry logs to safety occasion audit logs.

Kafka’s versatility shines in supporting key messaging patterns, treating messages as primary logs or structured key-value shops. Dynamic partitioning and constant ordering guarantee environment friendly message group. The unwavering reliability of Kafka aligns with our dedication to information integrity.

The combination of Ruby companies with Kafka is streamlined by the Karafka library, appearing as a higher-level wrapper. Our different language stack companies use related wrappers. Kafka’s sturdy debugging options and administrative instructions play a pivotal position in guaranteeing clean operations and infrastructure well being.

Kafka as an architectural pillar

In VMware Tanzu CloudHealth’s next-generation microservices platform, Kafka emerges as a vital architectural pillar. Its skill to deal with excessive information charges, help various messaging patterns, and assure message supply aligns seamlessly with our operational wants. As we proceed to innovate and scale, Kafka stays a steadfast companion, enabling us to construct a resilient and environment friendly infrastructure.

Why we migrated to Amazon MSK

For us, migrating to Amazon MSK got here down to 3 key choice factors:

Simplified technical operations – Working Kafka on a self-managed infrastructure was an operational overhead for us. We hadn’t up to date Kafka model 2.0.0 for some time, and Kafka brokers had been happening in manufacturing, inflicting points with matters going offline. We additionally needed to run scripts manually for growing replication elements and rebalancing leaders, which was further handbook effort.
Deprecated legacy pipelines and simplified permissions – We had been seeking to transfer away from our present pipelines written in Ansible to create Kafka matters on the cluster. We additionally had a cumbersome strategy of giving crew members entry to Kafka machines in staging and manufacturing, and we needed to simplify this.
Price, patching, and help – As a result of Apache Zookeeper is totally managed and patched by AWS, transferring to Amazon MSK was going to save lots of us money and time. As well as, we found that operating Amazon MSK with the identical sort of brokers on Amazon Elastic Compute Cloud (Amazon EC2) was cheaper to run on Amazon MSK. Mixed with the truth that we get safety patches utilized on brokers by AWS, migrating to Amazon MSK was a simple choice. This additionally meant that the crew was freed as much as work on different necessary issues. Lastly, getting enterprise help from AWS was additionally vital in our ultimate choice to maneuver to a managed resolution.

How we migrated to Amazon MSK

With the important thing drivers recognized, we moved forward with a proposed design emigrate present self-managed Kafka to Amazon MSK. We performed the next pre-migration steps earlier than the precise implementation:

Evaluation:
- Carried out a meticulous evaluation of the prevailing EC2 Kafka cluster, understanding its configurations and dependencies
- Verified Kafka model compatibility with Amazon MSK
Amazon MSK setup with Terraform
Community configuration:
- Ensured seamless community connectivity between the EC2 Kafka and MSK clusters, fine-tuning safety teams and firewall settings

After the pre-migration steps, we applied the next for the brand new design:

Automated deployment, improve, and matter creation pipelines for MSK clusters:
- Within the new setup, we needed to have automated deployments and upgrades of the MSK clusters in a repeatable vogue utilizing an IaC software. Due to this fact, we created customized Terraform modules for MSK cluster deployments in addition to upgrades. These modules the place referred to as from a Jenkins pipeline for automated deployments and upgrades of the MSK clusters. For Kafka matter creation, we had been utilizing an Ansible-based home-grown pipeline, which wasn’t secure and led to numerous complaints from dev groups. In consequence, we evaluated choices for deployments to Kubernetes clusters and used the Strimzi Matter Operator to create matters on MSK clusters. Matter creation was automated utilizing Jenkins pipelines, which dev groups might self-service.
Higher observability for clusters:
- The outdated Kafka clusters didn’t have good observability. We solely had alerts on Kafka dealer disk measurement. With Amazon MSK, we took benefit of open monitoring utilizing Prometheus. We stood up a standalone Prometheus server that scraped metrics from MSK clusters and despatched them to our inner observability software. On account of improved observability, we had been in a position to arrange sturdy alerting for Amazon MSK, which wasn’t doable with our outdated setup.
Improved COGS and higher compute infrastructure:
- For our outdated Kafka infrastructure, we needed to pay for managing Kafka, Zookeeper cases, plus any further dealer storage prices and information switch prices. With the transfer to Amazon MSK, as a result of Zookeeper is totally managed by AWS, we solely need to pay for Kafka nodes, dealer storage, and information switch prices. In consequence, in ultimate Amazon MSK setup for manufacturing, we saved not solely on infrastructure prices but in addition operational prices.
Simplified operations and enhanced safety:
- With the transfer to Amazon MSK, we didn’t need to handle any Zookeeper cases. Dealer safety patching was additionally taken care by AWS for us.
- Cluster upgrades grew to become easier with the transfer to Amazon MSK; it’s an easy course of to provoke from the Amazon MSK console.
- With Amazon MSK, we bought dealer computerized scaling out of the field. In consequence, we didn’t have to fret about brokers operating out of disk area, thereby resulting in further stability of the MSK cluster.
- We additionally bought further safety for the cluster as a result of Amazon MSK helps encryption at relaxation by default, and numerous choices for encryption in transit are additionally obtainable. For extra info, seek advice from Information safety in Amazon Managed Streaming for Apache Kafka.

Throughout our pre-migration steps, we validated the setup on the staging surroundings earlier than transferring forward with manufacturing.

Kafka matter migration technique

With the MSK cluster setup full, we carried out an information migration of Kafka matters from the outdated cluster operating on Amazon EC2 to the brand new MSK cluster. To realize this, we carried out the next steps:

Arrange MirrorMaker with Terraform – We used Terraform to orchestrate the deployment of a MirrorMaker cluster consisting of 15 nodes. This demonstrated the scalability and adaptability by adjusting the variety of nodes primarily based on the migration’s concurrent replication wants.
Implement a concurrent replication technique – We applied a concurrent replication technique with 15 MirrorMaker nodes to expedite the migration course of. Our Terraform-driven method contributed to price optimization by effectively managing sources in the course of the migration and ensured the reliability and consistency of the MSK and MirrorMaker clusters. It additionally showcased how the chosen setup accelerates information switch, optimizing each time and sources.
Migrate information – We efficiently migrated 2 TB of knowledge in a remarkably brief timeframe, minimizing downtime and showcasing the effectivity of the concurrent replication technique.
Arrange post-migration monitoring – We applied sturdy monitoring and alerting in the course of the migration, contributing to a clean course of by figuring out and addressing points promptly.

The next diagram illustrates the structure after the subject migration was full.
Mirror-maker setup

Challenges and classes realized

Embarking on a migration journey, particularly with massive datasets, is commonly accompanied by unexpected challenges. On this part, we delve into the challenges encountered in the course of the migration of matters from EC2 Kafka to Amazon MSK utilizing MirrorMaker, and share priceless insights and options that formed the success of our migration.

Problem 1: Offset discrepancies

One of many challenges we encountered was the mismatch in matter offsets between the supply and vacation spot clusters, even with offset synchronization enabled in MirrorMaker. The lesson realized right here was that offset values don’t essentially should be similar, so long as offset sync is enabled, which makes positive the matters have the right place to learn the info from.

We addressed this drawback by utilizing a customized software to run exams on client teams, confirming that the translated offsets had been both smaller or caught up, indicating synchronization as per MirrorMaker.

Problem 2: Sluggish information migration

The migration course of confronted a bottleneck—information switch was slower than anticipated, particularly with a considerable 2 TB dataset. Regardless of a 20-node MirrorMaker cluster, the pace was inadequate.

To beat this, the crew strategically grouped MirrorMaker nodes primarily based on distinctive port numbers. Clusters of 5 MirrorMaker nodes, every with a definite port, considerably boosted throughput, permitting us emigrate information inside hours as a substitute of days.

Problem 3: Lack of detailed course of documentation

Navigating the uncharted territory of migrating massive datasets utilizing MirrorMaker highlighted the absence of detailed documentation for such situations.

Via trial and error, the crew crafted an IaC module utilizing Terraform. This module streamlined the complete cluster creation course of with optimized settings, enabling a seamless begin to the migration inside minutes.

Remaining setup and subsequent steps

On account of the transfer to Amazon MSK, our ultimate setup after matter migration seemed like the next diagram.
MSK Blog
We’re contemplating the next future enhancements:

Conclusion.

On this submit, we mentioned how VMware Tanzu CloudHealth migrated their present Amazon EC2-based Kafka infrastructure to Amazon MSK. We walked you thru the brand new structure, deployment and matter creation pipelines, enhancements to observability and entry management, matter migration challenges, and the problems we confronted with the prevailing infrastructure, together with how and why we migrated to the brand new Amazon MSK setup. We additionally talked about all the benefits that Amazon MSK gave us, the ultimate structure we achieved with this migration, and classes realized.

For us, the interaction of offset synchronization, strategic node grouping, and IaC proved pivotal in overcoming obstacles and guaranteeing a profitable migration from Amazon EC2 Kafka to Amazon MSK. This submit serves as a testomony to the ability of adaptability and innovation in migration challenges, providing insights for others navigating an identical path.

When you’re operating self-managed Kafka on AWS, we encourage you to attempt the managed Kafka providing, Amazon MSK.

In regards to the Authors

Rivlin Pereira is Employees DevOps Engineer at VMware Tanzu Division. He’s very keen about Kubernetes and works on CloudHealth Platform constructing and working cloud options which might be scalable, dependable and price efficient.

Vaibhav Pandey, a Employees Software program Engineer at Broadcom, is a key contributor to the event of cloud computing options. Specializing in architecting and engineering information storage layers, he’s keen about constructing and scaling SaaS purposes for optimum efficiency.

Raj Ramasubbu is a Senior Analytics Specialist Options Architect targeted on large information and analytics and AI/ML with Amazon Net Providers. He helps prospects architect and construct extremely scalable, performant, and safe cloud-based options on AWS. Raj offered technical experience and management in constructing information engineering, large information analytics, enterprise intelligence, and information science options for over 18 years previous to becoming a member of AWS. He helped prospects in numerous trade verticals like healthcare, medical units, life science, retail, asset administration, automobile insurance coverage, residential REIT, agriculture, title insurance coverage, provide chain, doc administration, and actual property.

Todd McGrath is an information streaming specialist at Amazon Net Providers the place he advises prospects on their streaming methods, integration, structure, and options. On the private facet, he enjoys watching and supporting his 3 youngsters of their most popular actions in addition to following his personal pursuits reminiscent of fishing, pickleball, ice hockey, and completely satisfied hour with family and friends on pontoon boats. Join with him on LinkedIn.

Satya Pattanaik is a Sr. Options Architect at AWS. He has been serving to ISVs construct scalable and resilient purposes on AWS Cloud. Prior becoming a member of AWS, he performed vital position in Enterprise segments with their development and success. Exterior of labor, he spends time studying “tips on how to cook dinner a flavorful BBQ” and attempting out new recipes.