Millions of protein complexes added to AlphaFold Database shed light on how proteins interact
A new collaboration between EMBL's European Bioinformatics Institute (EMBL-EBI), Google DeepMind, NVIDIA, and Seoul National University has made millions of AI-predicted protein complex structures openly available through the AlphaFold Database. To maximize global health impact, the dataset prioritizes proteins important for understanding human health and disease. This is the largest dataset of protein complex predictions currently available.
Proteins are the building blocks of life. They interact to create protein complexes which fulfill biological functions. By visualizing protein interactions, scientists can uncover the molecular mechanisms that drive cell behavior, identify what goes wrong when someone gets sick, and develop new drugs and therapies. Predicting the structure of protein complexes is extremely challenging because, in nature, proteins change shape and interact in many different ways.
"Science thrives on collaboration," said Jo McEntyre, Interim Director of EMBL-EBI. "By making this foundational protein complex dataset openly available to the world, we're inviting researchers to test, refine, and build on it to drive the next wave of biological discoveries."
Protein complexes for global health impact
The latest AlphaFold Database update spans millions of homodimers—protein complexes formed of two identical proteins. It focuses on 20 of the most studied species, including humans, as well as the World Health Organization's bacterial priority pathogens list. This approach aims to bring significant and immediate value for global health challenges.
"By expanding the AlphaFold Database to include protein complexes, we are addressing a critical need expressed by the scientific community," said Anna Koivuniemi, Head of the Google DeepMind Impact Accelerator.
"We hope that by lowering the barrier to these complex predictions, we can empower researchers everywhere to pursue the next wave of discoveries that could ultimately improve human health on a global scale."
Scientific expertise meets technical innovation
The collaboration builds on Google DeepMind's AI system AlphaFold, which, since 2021, accurately predicted the structure of millions of proteins. To democratize access to AlphaFold predictions, Google DeepMind and EMBL-EBI developed the AlphaFold Database, an open resource that anyone can access. The database has over 3.4 million users from 190 countries.
Through ongoing dialogue with the scientific community, a clear need emerged to expand the AlphaFold database to include protein complexes. In response to this need, EMBL-EBI, Google DeepMind, NVIDIA, and Seoul National University teamed up, contributing specialist expertise and resources, to calculate and integrate millions of protein complexes into the AlphaFold Database.
The collaboration brought together deep biological expertise and technical innovations. NVIDIA and the Steinegger Lab at the Seoul National University developed the methodology, based on Google DeepMind's AI system AlphaFold, including accelerations to multiple sequence alignment calculations and deep learning inference.
NVIDIA provided cutting-edge AI infrastructure and scaled out inference pipelines to overcome limitations that historically made this scale of calculations challenging.
EMBL-EBI enabled the collaboration by bringing the other parties together and contributing expertise in scientific and biodata management, as well as analysis. As a champion of open science, EMBL-EBI, together with Google DeepMind, integrated the new dataset into the AlphaFold Database.
"NVIDIA's ambition is to consistently contribute orders-of-magnitude accelerations for fundamental digital biology workloads, enabling what was not possible before," said Anthony Costa, NVIDIA Director of Digital Biology. "This release is a great example of how AI infrastructure and software can uniquely enable new scales of biological understanding."
"By making predicted protein complexes accessible at an unprecedented scale, we are illuminating an unseen landscape of molecular interactions across the tree of life," explained Martin Steinegger, Associate Professor at Seoul National University.
Open science at scale
It takes a blend of AI-scale infrastructure and deep technical knowledge in accelerating complex workflows to generate AI predictions for protein complexes at this scale. The collaboration is centrally hosting data that would otherwise require around 17 million hours of GPU (graphics processing unit) computing to recreate.
By making these calculations once and adding the information into the AlphaFold Database, this collaboration aims to help democratize access to protein complex predictions. It enables scientists everywhere to investigate how proteins interact in the vast protein universe, and accelerate discoveries that could lead to new medicines, new products, and a deeper understanding of life itself.
This is the first step in an ambition to add a wide range of protein complex structure predictions to the AlphaFold Database. The partnership has already calculated predictions for 30 million complexes.
Of these, 1.7 million high-confidence homodimer predictions have been added to the AlphaFold Database. Another 18 million are lower-confidence homodimers, which are available as a list and for bulk download. The rest are heterodimers, currently being analyzed and assessed.
More protein complex predictions will be calculated and high-confidence predictions will be added to the AlphaFold Database in the coming months. The work is described in more detail in a preprint.
"The human genome has just over 20,000 different proteins. Despite this relatively small genome, human beings display incredibly complex pathways, processes and regulations," said Dame Janet Thornton, Director Emeritus of EMBL-EBI.
"Much of this complexity arises from the intermolecular interactions between proteins, and with small molecule ligands and DNA. Adding predicted protein-protein homodimeric interactions to the AlphaFold Database is a first step towards a comprehensive description of the human interactome, the basis by which human biology will be described and understood.
"This has relevance for the design of new therapeutics, understanding host-pathogen interactions, and more. Making these structures accessible to all allows every researcher around the world to build on these data, moving one step closer to predicting the biology of life."
Provided by European Molecular Biology Laboratory