Senior Software Engineer - Hail Team
Broad Institute | |
United States, Massachusetts, Cambridge | |
Jan 25, 2025 | |
Description & Requirements At the Broad Institute broadly and within the Neale Lab specifically, we leverage statistical and software techniques to understand the mechanisms of disease from extremely large datasets generated by scalable sequencing technologies. The lab and Institute are entering an age of one million sequences, millions of transcriptomes, tens of thousands of medical images, and complete medical records. The development of scalable scientific assays has transformed biological engineering problems into software engineering ones. We seek a senior software engineer to help solve those problems.
This team develops, maintains, and operates Hail, a suite of libraries, data systems, and services for analyzing the world's largest genome sequencing datasets. Hail supports scientists beginning with individual sequences through the production of a sequencing matrix, the calculation of per-row and per-column statistics, distributed matrix multiplications to search for genetic relatedness, preparation of thousands of phenotypes per sequence, regression to search for genetic associations with phenotypes, subsetting and export for distribution to collaborators, and as a data store for web-based data browsers and rare disease diagnostic support systems.
The team faces three major challenges in the coming years. First, the largest sequencing callset has doubled every year since 2003 and the next doubling is anticipated in 2025. Second, the phenotypes have grown from binary disease status tables to medical records, medical images, and cellular assays. Third, the project must adapt to the changing hardware landscape, new scientific-analytical techniques, and new analytical databases.
Hail's two core products are Query and Batch, both of which are open source and openly developed. We are seeking a Senior Software Engineer to focus primarily on Batch. Batch is a cost-metered, multi-tenant, spot-tolerant, elastic, horizontally-scalable compute engine. The team operates an installation of Batch as a Software-as-a-Service for a community of hundreds of scientists within the Broad Institute.
Batch is implemented in Python, the control plane is deployed on Kubernetes, the compute plan is a directly managed set of VMs. Batch relies on many technologies including: OCI container images, crun, Google and Azure cloud storage, Google and Azure VM APIs, Google and Azure container registry APIs, Grafana, Prometheus, OAuth2, MySQL, Envoy, and asyncio. Responsibilities
Requirements
In addition to Python, our current technology stack also includes the JVM, Scala, GCP, Azure, and C++. Our domain knowledge includes machine learning, bioinformatics, statistical genetics, compilers, and theoretical math. Hires need not have experience with every aspect of our technologies and domains. Our website: https://hail.is. Our GitHub: https://github.com/hail-is/hail. |