How the GoDaddy information platform achieved over 60% price discount and 50% efficiency increase by adopting Amazon EMR Serverless

March 13, 2024

14

This can be a visitor publish co-written with Brandon Abear, Dinesh Sharma, John Bush, and Ozcan IIikhan from GoDaddy.

GoDaddy empowers on a regular basis entrepreneurs by offering all the assistance and instruments to succeed on-line. With greater than 20 million clients worldwide, GoDaddy is the place individuals come to call their concepts, construct knowledgeable web site, appeal to clients, and handle their work.

At GoDaddy, we take pleasure in being a data-driven firm. Our relentless pursuit of useful insights from information fuels our enterprise choices and ensures buyer satisfaction. Our dedication to effectivity is unwavering, and we’ve undertaken an thrilling initiative to optimize our batch processing jobs. On this journey, we now have recognized a structured strategy that we consult with because the seven layers of enchancment alternatives. This technique has develop into our information within the pursuit of effectivity.

On this publish, we focus on how we enhanced operational effectivity with Amazon EMR Serverless. We share our benchmarking outcomes and methodology, and insights into the cost-effectiveness of EMR Serverless vs. mounted capability Amazon EMR on EC2 transient clusters on our information workflows orchestrated utilizing Amazon Managed Workflows for Apache Airflow (Amazon MWAA). We share our technique for the adoption of EMR Serverless in areas the place it excels. Our findings reveal vital advantages, together with over 60% price discount, 50% quicker Spark workloads, a outstanding five-times enchancment in improvement and testing velocity, and a major discount in our carbon footprint.

Background

In late 2020, GoDaddy’s information platform initiated its AWS Cloud journey, migrating an 800-node Hadoop cluster with 2.5 PB of knowledge from its information heart to EMR on EC2. This lift-and-shift strategy facilitated a direct comparability between on-premises and cloud environments, guaranteeing a clean transition to AWS pipelines, minimizing information validation points and migration delays.

By early 2022, we efficiently migrated our huge information workloads to EMR on EC2. Utilizing finest practices discovered from the AWS FinHack program, we fine-tuned resource-intensive jobs, transformed Pig and Hive jobs to Spark, and diminished our batch workload spend by 22.75% in 2022. Nevertheless, scalability challenges emerged because of the multitude of jobs. This prompted GoDaddy to embark on a scientific optimization journey, establishing a basis for extra sustainable and environment friendly huge information processing.

Seven layers of enchancment alternatives

In our quest for operational effectivity, we now have recognized seven distinct layers of alternatives for optimization inside our batch processing jobs, as proven within the following determine. These layers vary from exact code-level enhancements to extra complete platform enhancements. This multi-layered strategy has develop into our strategic blueprint within the ongoing pursuit of higher efficiency and better effectivity.

Seven layers of improvement opportunities

The layers are as follows:

Code optimization – Focuses on refining the code logic and the way it may be optimized for higher efficiency. This includes efficiency enhancements by way of selective caching, partition and projection pruning, be a part of optimizations, and different job-specific tuning. Utilizing AI coding options can also be an integral a part of this course of.
Software program updates – Updating to the newest variations of open supply software program (OSS) to capitalize on new options and enhancements. For instance, Adaptive Question Execution in Spark 3 brings vital efficiency and value enhancements.
Customized Spark configurations – Tuning of customized Spark configurations to maximise useful resource utilization, reminiscence, and parallelism. We will obtain vital enhancements by right-sizing duties, corresponding to by way of spark.sql.shuffle.partitions, spark.sql.information.maxPartitionBytes, spark.executor.cores, and spark.executor.reminiscence. Nevertheless, these customized configurations is likely to be counterproductive if they aren’t suitable with the particular Spark model.
Useful resource provisioning time – The time it takes to launch assets like ephemeral EMR clusters on Amazon Elastic Compute Cloud (Amazon EC2). Though some elements influencing this time are outdoors of an engineer’s management, figuring out and addressing the elements that may be optimized may help scale back general provisioning time.
Nice-grained scaling at process degree – Dynamically adjusting assets corresponding to CPU, reminiscence, disk, and community bandwidth based mostly on every stage’s wants inside a process. The goal right here is to keep away from mounted cluster sizes that might end in useful resource waste.
Nice-grained scaling throughout a number of duties in a workflow – Given that every process has distinctive useful resource necessities, sustaining a set useful resource measurement might end in under- or over-provisioning for sure duties throughout the similar workflow. Historically, the dimensions of the most important process determines the cluster measurement for a multi-task workflow. Nevertheless, dynamically adjusting assets throughout a number of duties and steps inside a workflow end in a less expensive implementation.
Platform-level enhancements – Enhancements at previous layers can solely optimize a given job or a workflow. Platform enchancment goals to realize effectivity on the firm degree. We will obtain this by way of numerous means, corresponding to updating or upgrading the core infrastructure, introducing new frameworks, allocating acceptable assets for every job profile, balancing service utilization, optimizing using Financial savings Plans and Spot Cases, or implementing different complete modifications to spice up effectivity throughout all duties and workflows.

Layers 1–3: Earlier price reductions

After we migrated from on premises to AWS Cloud, we primarily targeted our cost-optimization efforts on the primary three layers proven within the diagram. By transitioning our most expensive legacy Pig and Hive pipelines to Spark and optimizing Spark configurations for Amazon EMR, we achieved vital price financial savings.

For instance, a legacy Pig job took 10 hours to finish and ranked among the many prime 10 most costly EMR jobs. Upon reviewing TEZ logs and cluster metrics, we found that the cluster was vastly over-provisioned for the information quantity being processed and remained under-utilized for a lot of the runtime. Transitioning from Pig to Spark was extra environment friendly. Though no automated instruments had been accessible for the conversion, guide optimizations had been made, together with:

Diminished pointless disk writes, saving serialization and deserialization time (Layer 1)
Changed Airflow process parallelization with Spark, simplifying the Airflow DAG (Layer 1)
Eradicated redundant Spark transformations (Layer 1)
Upgraded from Spark 2 to three, utilizing Adaptive Question Execution (Layer 2)
Addressed skewed joins and optimized smaller dimension tables (Layer 3)

Consequently, job price decreased by 95%, and job completion time was diminished to 1 hour. Nevertheless, this strategy was labor-intensive and never scalable for quite a few jobs.

Layers 4–6: Discover and undertake the fitting compute answer

In late 2022, following our vital accomplishments in optimization on the earlier ranges, our consideration moved in direction of enhancing the remaining layers.

Understanding the state of our batch processing

We use Amazon MWAA to orchestrate our information workflows within the cloud at scale. Apache Airflow is an open supply instrument used to programmatically writer, schedule, and monitor sequences of processes and duties known as workflows. On this publish, the phrases workflow and job are used interchangeably, referring to the Directed Acyclic Graphs (DAGs) consisting of duties orchestrated by Amazon MWAA. For every workflow, we now have sequential or parallel duties, and even a mix of each within the DAG between create_emr and terminate_emr duties operating on a transient EMR cluster with mounted compute capability all through the workflow run. Even after optimizing a portion of our workload, we nonetheless had quite a few non-optimized workflows that had been under-utilized resulting from over-provisioning of compute assets based mostly on probably the most resource-intensive process within the workflow, as proven within the following determine.

This highlighted the impracticality of static useful resource allocation and led us to acknowledge the need of a dynamic useful resource allocation (DRA) system. Earlier than proposing an answer, we gathered intensive information to completely perceive our batch processing. Analyzing the cluster step time, excluding provisioning and idle time, revealed vital insights: a right-skewed distribution with over half of the workflows finishing in 20 minutes or much less and solely 10% taking greater than 60 minutes. This distribution guided our alternative of a fast-provisioning compute answer, dramatically decreasing workflow runtimes. The next diagram illustrates step occasions (excluding provisioning and idle time) of EMR on EC2 transient clusters in one in every of our batch processing accounts.

Moreover, based mostly on the step time (excluding provisioning and idle time) distribution of the workflows, we categorized our workflows into three teams:

Fast run – Lasting 20 minutes or much less
Medium run – Lasting between 20–60 minutes
Long term – Exceeding 60 minutes, usually spanning a number of hours or extra

One other issue we wanted to think about was the intensive use of transient clusters for causes corresponding to safety, job and value isolation, and purpose-built clusters. Moreover, there was a major variation in useful resource wants between peak hours and intervals of low utilization.

As a substitute of fixed-size clusters, we might doubtlessly use managed scaling on EMR on EC2 to attain some price advantages. Nevertheless, migrating to EMR Serverless seems to be a extra strategic route for our information platform. Along with potential price advantages, EMR Serverless presents further benefits corresponding to a one-click improve to the latest Amazon EMR variations, a simplified operational and debugging expertise, and automated upgrades to the newest generations upon rollout. These options collectively simplify the method of working a platform on a bigger scale.

Evaluating EMR Serverless: A case research at GoDaddy

EMR Serverless is a serverless possibility in Amazon EMR that eliminates the complexities of configuring, managing, and scaling clusters when operating huge information frameworks like Apache Spark and Apache Hive. With EMR Serverless, companies can get pleasure from quite a few advantages, together with cost-effectiveness, quicker provisioning, simplified developer expertise, and improved resilience to Availability Zone failures.

Recognizing the potential of EMR Serverless, we performed an in-depth benchmark research utilizing actual manufacturing workflows. The research aimed to evaluate EMR Serverless efficiency and effectivity whereas additionally creating an adoption plan for large-scale implementation. The findings had been extremely encouraging, displaying EMR Serverless can successfully deal with our workloads.

Benchmarking methodology

We cut up our information workflows into three classes based mostly on whole step time (excluding provisioning and idle time): fast run (0–20 minutes), medium run (20–60 minutes), and long term (over 60 minutes). We analyzed the affect of the EMR deployment kind (Amazon EC2 vs. EMR Serverless) on two key metrics: cost-efficiency and whole runtime speedup, which served as our general analysis standards. Though we didn’t formally measure ease of use and resiliency, these elements had been thought-about all through the analysis course of.

The high-level steps to evaluate the setting are as follows:

Put together the information and setting:
1. Select three to 5 random manufacturing jobs from every job class.
2. Implement required changes to stop interference with manufacturing.
Run exams:
1. Run scripts over a number of days or by way of a number of iterations to assemble exact and constant information factors.
2. Carry out exams utilizing EMR on EC2 and EMR Serverless.
Validate information and take a look at runs:
1. Validate enter and output datasets, partitions, and row counts to make sure equivalent information processing.
Collect metrics and analyze outcomes:
1. Collect related metrics from the exams.
2. Analyze outcomes to attract insights and conclusions.

Benchmark outcomes

Our benchmark outcomes confirmed vital enhancements throughout all three job classes for each runtime speedup and cost-efficiency. The enhancements had been most pronounced for fast jobs, instantly ensuing from quicker startup occasions. For example, a 20-minute (together with cluster provisioning and shut down) information workflow operating on an EMR on EC2 transient cluster of mounted compute capability finishes in 10 minutes on EMR Serverless, offering a shorter runtime with price advantages. General, the shift to EMR Serverless delivered substantial efficiency enhancements and value reductions at scale throughout job brackets, as seen within the following determine.

Traditionally, we devoted extra time to tuning our long-run workflows. Curiously, we found that the prevailing customized Spark configurations for these jobs didn’t all the time translate properly to EMR Serverless. In instances the place the outcomes had been insignificant, a standard strategy was to discard earlier Spark configurations associated to executor cores. By permitting EMR Serverless to autonomously handle these Spark configurations, we regularly noticed improved outcomes. The next graph reveals the typical runtime and value enchancment per job when evaluating EMR Serverless to EMR on EC2.

Per Job Improvement

The next desk reveals a pattern comparability of outcomes for a similar workflow operating on totally different deployment choices of Amazon EMR (EMR on EC2 and EMR Serverless).

Metric	EMR on EC2 (Common)	EMR Serverless (Common)	EMR on EC2 vs EMR Serverless
Whole Run Price ($)	$ 5.82	$ 2.60	55%
Whole Run Time (Minutes)	53.40	39.40	26%
Provisioning Time (Minutes)	10.20	0.05	.
Provisioning Price ($)	$ 1.19	.	.
Steps Time (Minutes)	38.20	39.16	-3%
Steps Price ($)	$ 4.30	.	.
Idle Time (Minutes)	4.80	.	.
EMR Launch Label	emr-6.9.0		.
Hadoop Distribution	Amazon 3.3.3		.
Spark Model	Spark 3.3.0		.
Hive/HCatalog Model	Hive 3.1.3, HCatalog 3.1.3		.
Job Sort	Spark		.

AWS Graviton2 on EMR Serverless efficiency analysis

After seeing compelling outcomes with EMR Serverless for our workloads, we determined to additional analyze the efficiency of the AWS Graviton2 (arm64) structure inside EMR Serverless. AWS had benchmarked Spark workloads on Graviton2 EMR Serverless utilizing the TPC-DS 3TB scale, displaying a 27% general price-performance enchancment.

To raised perceive the combination advantages, we ran our personal research utilizing GoDaddy’s manufacturing workloads on a day by day schedule and noticed a powerful 23.8% price-performance enhancement throughout a variety of jobs when utilizing Graviton2. For extra particulars about this research, see GoDaddy benchmarking leads to as much as 24% higher price-performance for his or her Spark workloads with AWS Graviton2 on Amazon EMR Serverless.

Adoption technique for EMR Serverless

We strategically applied a phased rollout of EMR Serverless through deployment rings, enabling systematic integration. This gradual strategy allow us to validate enhancements and halt additional adoption of EMR Serverless, if wanted. It served each as a security web to catch points early and a way to refine our infrastructure. The method mitigated change affect by way of clean operations whereas constructing crew experience of our Knowledge Engineering and DevOps groups. Moreover, it fostered tight suggestions loops, permitting immediate changes and guaranteeing environment friendly EMR Serverless integration.

We divided our workflows into three major adoption teams, as proven within the following picture:

Canaries – This group aids in detecting and resolving any potential issues early within the deployment stage.
Early adopters – That is the second batch of workflows that undertake the brand new compute answer after preliminary points have been recognized and rectified by the canaries group.
Broad deployment rings – The biggest group of rings, this group represents the wide-scale deployment of the answer. These are deployed after profitable testing and implementation within the earlier two teams.

Rings

We additional broke down these workflows into granular deployment rings to undertake EMR Serverless, as proven within the following desk.

Ring #	Title	Particulars
Ring 0	Canary	Low adoption threat jobs which can be anticipated to yield some price saving advantages.
Ring 1	Early Adopters	Low threat Fast-run Spark jobs that count on to yield excessive good points.
Ring 2	Fast-run	Remainder of the Fast-run (`step_time` <= 20 min) Spark jobs
Ring 3	LargerJobs_EZ	Excessive potential acquire, straightforward transfer, medium-run and long-run Spark jobs
Ring 4	LargerJobs	Remainder of the medium-run and long-run Spark jobs with potential good points
Ring 5	Hive	Hive jobs with doubtlessly greater price financial savings
Ring 6	Redshift_EZ	Simple migration Redshift jobs that go well with EMR Serverless
Ring 7	Glue_EZ	Simple migration Glue jobs that go well with EMR Serverless

Manufacturing adoption outcomes abstract

The encouraging benchmarking and canary adoption outcomes generated appreciable curiosity in wider EMR Serverless adoption at GoDaddy. Up to now, the EMR Serverless rollout stays underway. To date, it has diminished prices by 62.5% and accelerated whole batch workflow completion by 50.4%.

Primarily based on preliminary benchmarks, our crew anticipated substantial good points for fast jobs. To our shock, precise manufacturing deployments surpassed projections, averaging 64.4% quicker vs. 42% projected, and 71.8% cheaper vs. 40% predicted.

Remarkably, long-running jobs additionally noticed vital efficiency enhancements because of the fast provisioning of EMR Serverless and aggressive scaling enabled by dynamic useful resource allocation. We noticed substantial parallelization throughout high-resource segments, leading to a 40.5% quicker whole runtime in comparison with conventional approaches. The next chart illustrates the typical enhancements per job class.

Prod Jobs Savings

Moreover, we noticed the best diploma of dispersion for velocity enhancements throughout the long-run job class, as proven within the following box-and-whisker plot.

Whisker Plot

Pattern workflows adopted EMR Serverless

For a big workflow migrated to EMR Serverless, evaluating 3-week averages pre- and post-migration revealed spectacular price financial savings—a 75.30% lower based mostly on retail pricing with 10% enchancment in whole runtime, boosting operational effectivity. The next graph illustrates the price development.

Though quick-run jobs realized minimal per-dollar price reductions, they delivered probably the most vital share price financial savings. With hundreds of those workflows operating day by day, the amassed financial savings are substantial. The next graph reveals the price development for a small workload migrated from EMR on EC2 to EMR Serverless. Evaluating 3-week pre- and post-migration averages revealed a outstanding 92.43% price financial savings on the retail on-demand pricing, alongside an 80.6% acceleration in whole runtime.

Sample workflows adopted EMR Serverless 2

Layer 7: Platform-wide enhancements

We goal to revolutionize compute operations at GoDaddy, offering simplified but highly effective options for all customers with our Clever Compute Platform. With AWS compute options like EMR Serverless and EMR on EC2, it offered optimized runs of knowledge processing and machine studying (ML) workloads. An ML-powered job dealer intelligently determines when and run jobs based mostly on numerous parameters, whereas nonetheless permitting energy customers to customise. Moreover, an ML-powered compute useful resource supervisor pre-provisions assets based mostly on load and historic information, offering environment friendly, quick provisioning at optimum price. Clever compute empowers customers with out-of-the-box optimization, catering to various personas with out compromising energy customers.

The next diagram reveals a high-level illustration of the clever compute structure.

Insights and beneficial best-practices

The next part discusses the insights we’ve gathered and the beneficial finest practices we’ve developed throughout our preliminary and wider adoption phases.

Infrastructure preparation

Though EMR Serverless is a deployment technique inside EMR, it requires some infrastructure preparedness to optimize its potential. Take into account the next necessities and sensible steering on implementation:

Use massive subnets throughout a number of Availability Zones – When operating EMR Serverless workloads inside your VPC, make certain the subnets span throughout a number of Availability Zones and are usually not constrained by IP addresses. Seek advice from Configuring VPC entry and Finest practices for subnet planning for particulars.
Modify most concurrent vCPU quota – For intensive compute necessities, it is suggested to extend your max concurrent vCPUs per account service quota.
Amazon MWAA model compatibility – When adopting EMR Serverless, GoDaddy’s decentralized Amazon MWAA ecosystem for information pipeline orchestration created compatibility points from disparate AWS Suppliers variations. Immediately upgrading Amazon MWAA was extra environment friendly than updating quite a few DAGs. We facilitated adoption by upgrading Amazon MWAA situations ourselves, documenting points, and sharing findings and energy estimates for correct improve planning.
GoDaddy EMR operator – To streamline migrating quite a few Airflow DAGs from EMR on EC2 to EMR Serverless, we developed customized operators adapting present interfaces. This allowed seamless transitions whereas retaining acquainted tuning choices. Knowledge engineers might simply migrate pipelines with easy find-replace imports and instantly use EMR Serverless.

Sudden conduct mitigation

The next are surprising behaviors we bumped into and what we did to mitigate them:

Spark DRA aggressive scaling – For some jobs (8.33% of preliminary benchmarks, 13.6% of manufacturing), price elevated after migrating to EMR Serverless. This was resulting from Spark DRA excessively assigning new staff briefly, prioritizing efficiency over price. To counteract this, we set most executor thresholds by adjusting spark.dynamicAllocation.maxExecutor, successfully limiting EMR Serverless scaling aggression. When migrating from EMR on EC2, we propose observing the max core rely within the Spark Historical past UI to copy comparable compute limits in EMR Serverless, corresponding to --conf spark.executor.cores and --conf spark.dynamicAllocation.maxExecutors.
Managing disk house for large-scale jobs – When transitioning jobs that course of massive information volumes with substantial shuffles and vital disk necessities to EMR Serverless, we advocate configuring spark.emr-serverless.executor.disk by referring to present Spark job metrics. Moreover, configurations like spark.executor.cores mixed with spark.emr-serverless.executor.disk and spark.dynamicAllocation.maxExecutors permit management over the underlying employee measurement and whole hooked up storage when advantageous. For instance, a shuffle-heavy job with comparatively low disk utilization might profit from utilizing a bigger employee to extend the chance of native shuffle fetches.

Conclusion

As mentioned on this publish, our experiences with adopting EMR Serverless on arm64 have been overwhelmingly optimistic. The spectacular outcomes we’ve achieved, together with a 60% discount in price, 50% quicker runs of batch Spark workloads, and an astounding five-times enchancment in improvement and testing velocity, communicate volumes in regards to the potential of this know-how. Moreover, our present outcomes recommend that by broadly adopting Graviton2 on EMR Serverless, we might doubtlessly scale back the carbon footprint by as much as 60% for our batch processing.

Nevertheless, it’s essential to know that these outcomes are usually not a one-size-fits-all situation. The enhancements you’ll be able to count on are topic to elements together with, however not restricted to, the particular nature of your workflows, cluster configurations, useful resource utilization ranges, and fluctuations in computational capability. Due to this fact, we strongly advocate for a data-driven, ring-based deployment technique when contemplating the combination of EMR Serverless, which may help optimize its advantages to the fullest.

Particular due to Mukul Sharma and Boris Berlin for his or her contributions to benchmarking. Many due to Travis Muhlestein (CDO), Abhijit Kundu (VP Eng), Vincent Yung (Sr. Director Eng.), and Wai Kin Lau (Sr. Director Knowledge Eng.) for his or her continued help.

Concerning the Authors

Brandon Abear is a Principal Knowledge Engineer within the Knowledge & Analytics (DnA) group at GoDaddy. He enjoys all issues huge information. In his spare time, he enjoys touring, watching motion pictures, and enjoying rhythm video games.

Dinesh Sharma is a Principal Knowledge Engineer within the Knowledge & Analytics (DnA) group at GoDaddy. He’s captivated with consumer expertise and developer productiveness, all the time on the lookout for methods to optimize engineering processes and saving price. In his spare time, he loves studying and is an avid manga fan.

John Bush is a Principal Software program Engineer within the Knowledge & Analytics (DnA) group at GoDaddy. He’s captivated with making it simpler for organizations to handle information and use it to drive their companies ahead. In his spare time, he loves mountain climbing, tenting, and using his ebike.

Ozcan Ilikhan is the Director of Engineering for the Knowledge and ML Platform at GoDaddy. He has over 20 years of multidisciplinary management expertise, spanning startups to international enterprises. He has a ardour for leveraging information and AI in creating options that delight clients, empower them to attain extra, and increase operational effectivity. Exterior of his skilled life, he enjoys studying, mountain climbing, gardening, volunteering, and embarking on DIY tasks.

Harsh Vardhan is an AWS Options Architect, specializing in huge information and analytics. He has over 8 years of expertise working within the discipline of massive information and information science. He’s captivated with serving to clients undertake finest practices and uncover insights from their information.