By applying artificial intelligence models known as large-scale language models, researchers have made significant progress in their ability to predict the structure of proteins from their sequences. However, this approach has not been successful with antibodies, in part because of the excessive variability seen in these types of proteins.
To overcome these limitations, MIT researchers developed a computational technique that allows large-scale language models to more accurately predict antibody structures. Their work will allow researchers to screen millions of possible antibodies to identify ones that can be used to treat SARS-CoV-2 and other infectious diseases.
“Our method allows us to scale up to the point where we can actually find a few needles in a haystack, while others do not,” says Bonnie Berger, Simons Professor of Mathematics and head of the Computational and Biological Sciences Group at MIT Computers. Science and Artificial Intelligence Laboratory (CSAIL) and one of the lead authors of the new study. “If we can help prevent drug companies from going into clinical trials with the wrong thing, it could really save them a lot of money.”
This technology, which focuses on modeling the hypervariable regions of antibodies, also has the potential to analyze an individual’s entire antibody repertoire. This could be useful for studying the immune response of people who are super-responders to diseases such as HIV and to find out why their antibodies are so effective at fending off the virus.
Bryan Bryson, an associate professor of biological engineering at MIT and a member of the Ragon Institute at MGH, MIT and Harvard, is also senior author of the paper published this week. Publications of the National Academy of Sciences. Rohit Singh, a former CSAIL research scientist and currently an assistant professor of biostatistics, bioinformatics, and cell biology at Duke University, and Chiho Lim, 22, are the lead authors of the paper. Researchers from Sanofi and ETH Zurich also contributed to the study.
Hypervariability Modeling
Proteins are made up of long chains of amino acids, which can be folded into a huge number of possible structures. Recently, it has become much easier to predict these structures using artificial intelligence programs such as AlphaFold. Many of these programs, such as ESMFold and OmegaFold, are based on large-scale language models that were originally developed to analyze large amounts of text, allowing them to learn how to predict the next word in a sequence. This same approach can also be applied to protein sequences. That is, by learning which protein structures are most likely to be formed from different amino acid patterns.
However, this technique does not always work with antibodies, especially antibody segments known as hypervariable regions. Antibodies typically have a Y-shaped structure and these hypervariable regions are located at the end of the Y and detect and bind to foreign proteins, also known as antigens. The bottom part of the Y provides structural support and helps the antibody interact with immune cells.
Hypervariable regions vary in length but typically contain less than 40 amino acids. It is estimated that the human immune system can produce up to 100 quintillion different antibodies by changing the sequence of these amino acids, helping the body respond to a wide variety of potential antigens. These sequences are not evolutionarily constrained in the same way as other protein sequences, making it difficult for large-scale language models to learn how to accurately predict their structure.
“One of the reasons why language models can predict protein structures so well is because evolution constrains these sequences in a way that the model can decipher what those constraints mean,” says Singh. “This is similar to learning grammar rules by looking at the context of a word in a sentence and figuring out what it means.”
To model these hypervariable regions, the researchers created two modules based on existing protein language models. One of these modules is trained on the hypervariable sequences of approximately 3,000 antibody structures found in the Protein Data Bank (PDB), allowing it to learn which sequences tend to produce similar structures. Another module was trained on data correlating about 3,700 antibody sequences with how strongly they bind to three different antigens.
A computational model known as AbMap can predict antibody structure and binding strength based on its amino acid sequence. To demonstrate the usefulness of this model, the researchers used it to predict an antibody structure that potently neutralizes the spike protein of the SARS-CoV-2 virus.
The researchers started with a set of antibodies predicted to bind to this target and then created millions of variants by altering the hypervariable regions. Their model was able to identify the antibody structures that would be most successful with much greater accuracy than traditional protein structure models based on large-scale language models.
The researchers then took an additional step to group the antibodies into groups with similar structures. Working with researchers at Sanofi, they selected antibodies from each cluster to test experimentally. The experiments showed that 82% of these antibodies had better binding capacity than the original antibodies used in the model.
Identifying a variety of good candidates early in the development process could help pharmaceutical companies avoid spending large sums of money testing candidates that end up failing later, the researchers say.
“They don’t want to put all their eggs in one basket,” says Singh. “They don’t want to say I’m going to take this one antibody and run it through preclinical testing and find it to be toxic. They would rather have a series of good possibilities and run through them all so they have some options if one goes wrong.”
Antibody comparison
Using this technique, researchers may also try to answer some long-standing questions about why people respond differently to infections. For example, why do some people get a much more severe form of COVID-19, and why do some people exposed to HIV never get infected at all?
Scientists have been trying to answer these questions by performing single-cell RNA sequencing of an individual’s immune cells and comparing them. This process is known as antibody repertoire analysis. Previous research has shown that the antibody repertoires of two different people may overlap by only about 10%. However, sequencing does not provide as comprehensive a picture of antibody performance as structural information. This is because two antibodies with different sequences can have similar structures and functions.
The new model could help solve the problem by quickly generating structures for all antibodies found in an individual. In the study, the researchers showed that when considering structure, there is much more overlap between individuals than the 10% seen in sequence comparisons. They now plan to further investigate how these structures may contribute to the body’s overall immune response to specific pathogens.
“This is where language models fit so beautifully because they approach the accuracy of structure-based analysis while having the scalability of sequence-based analysis,” says Singh.
This research was funded by Sanofi and the Abdul Latif Jameel Clinic for Machine Learning in Health.