This weblog was written in collaboration with Sukh Sekhon, Software program Engineer, Cloud Infrastructure and Helen Li, Sr. Director of Engineering at Exai Bio.
Exai Bio is a next-generation liquid biopsy firm with a mission to allow a world the place most cancers could be detected early, recognized precisely, handled in a customized and focused means, and finally cured. On this weblog publish, we describe how our engineering workforce at Exai Bio is leveraging Databricks to deliver software program engineering greatest practices to life sciences analysis and improvement (R&D).
Extremely Differentiated Strategy to Liquid Biopsy
Exai’s platform makes use of RNA sequencing to establish a novel class of cancer-associated, small non-coding RNAs, termed orphan non-coding RNAs (oncRNAs). This know-how relies on analysis first printed by Dr. Hani Goodarzi’s lab on the College of California, San Francisco, in Nature Medication in 2018. oncRNAs are actively secreted from dwelling cells and are steady and ample within the blood of most cancers sufferers, making them a novel sort of most cancers biomarker that’s accessible by means of an ordinary blood draw. Within the 18 months since its founding, Exai Bio has carried out analyses spanning 12 cancers and over 10,000 topics, constructing one of many largest smRNA sequencing dataset and oncRNA profiles in most cancers and normal populations.
Challenges in Life Sciences R&D
What Life Sciences analysis is and why it can be crucial
Life Sciences R&D refers back to the systematic investigation and improvement efforts with the purpose of manufacturing new or considerably improved merchandise, processes, or information to finally profit sufferers. Exai is targeted on growing exams that can help with early detection and actionable insights into most cancers.
How that is an engineering drawback
The fast-paced and evolving nature of analysis pushes researchers to undertake or develop new instruments and methods; sadly, this usually results in efficiency testing, vulnerability scanning and code provenance being an afterthought. Some researchers are acquainted with operating instructions like awk
and parallel
on sequencing information saved on their native laptop, others are acquainted with submitting jobs to high-performance computing (HPC) machines operated by a tutorial establishment. Some researchers have scripts that run steps serially whereas some leverage workflow orchestration frameworks. Software program engineers are challenged to not solely create a standardized atmosphere for researchers to make use of, however one that’s versatile sufficient to cater to those various backgrounds.
How some options depart one thing to be desired
There are infrastructural instruments that researchers are more and more adopting. Nevertheless, counting on these options alone current sure gaps. For smaller firms, addressing these gaps usually means resorting to makeshift options or ‘duct taping’, which requires long run funding and lacks future adaptability. As a startup, we’re capable of innovate faster utilizing Databricks’ sturdy set of options, whereas having the pliability to deliver our personal instruments as wanted.
Accelerated R&D with Databricks
Exai’s founding workforce, which incorporates pioneers in genomics and oncology, is aware of effectively the engineering challenges in life sciences analysis. On day 1, our engineering workforce got down to uncover an answer permitting for reproducible analysis, accelerated science, and a safe information platform. We discovered a compelling providing in Databricks. As we proceed our journey, we’re constantly impressed by the continued enhancements and improvements that Databricks brings to our data-driven workflows.
1. Reproducible analysis
In our pursuit of reproducibility, Databricks Container Companies and Repos have been invaluable options.
By way of Databricks Container Companies, we’re capable of management software program dependencies and runtime necessities by standardizing the compute environments researchers rely upon. As containerization grew to become more and more worthwhile for researchers, we established a streamlined CI/CD pipeline to effectively handle the growing demand for customized Docker pictures. This pipeline considerably elevated our engineering workforce’s capacity to quickly create, take a look at, and deploy these pictures.
By way of Databricks Repos, we introduced git-based workflows to information evaluation, permitting notebooks to be code reviewed. Our researchers embraced this code evaluation course of, a typical software program engineering apply. We discover that it facilitates collaboration, ensures reproducible analyses, and identifies bugs early.
2. Accelerated science
Databricks permits us to maneuver quick on our analysis timeline. Since Exai’s founding, we’ve offered over seven analysis datasets at prime oncology conferences. Databricks’s intuitive cluster administration lets our researchers harness cloud assets with out diving deep into specialised cloud information. Although compute is available to all researchers, we’re capable of handle our value inside finances with options like cluster termination coverage and price breakdown.
Having a full suite of Lakehouse capabilities (ETL, visualization, governance) in an built-in atmosphere mitigates information silos. Databricks Workflows enable researchers with out expertise in workflow orchestration instruments to specify job dependencies and parallelize their analyses with ease.
Given Exai’s mission, it’s vital to have a quick suggestions loop in our methodology improvement. The convenience of compute and organizing information in Databricks made this potential.
3. Safe information platform
Exai generates and processes terabytes of sequencing information weekly. We require information safety, privateness and confidentiality. Databricks’s Safety and Belief Middle makes it straightforward for us to fulfill compliance frameworks.
With sensible and available infrastructure-as-code documentation from Databricks, we’re capable of handle and scale our infrastructure with ease. Utilizing Databricks’ Terraform supplier, we adopted the Databricks Safety Reference Structure; as well as, we carried out a centralized community structure following this information to channel the move of visitors by means of a single community firewall.
Information is as safe because the code that runs on it. Understanding the place code comes from, who has modified it, and the way it has advanced, is essential. We didn’t need this side to be an afterthought at Exai. Databricks permits us to make use of our personal community safety controls and make software program out there by means of package deal distribution mechanisms. We subsequently believe that packages have been vetted earlier than they’re utilized by our researchers.
Under is the structure diagram that illustrates key constructing blocks of our software program engineering infrastructure constructed with Databricks operating on AWS for our R&D.
We’re excited to share what we’ve discovered constructing on Databricks. If in case you have questions, please attain out to our engineering workforce at [email protected].