Have you ever been asked a question that you only know part of the answer to? To get a more informed answer, the best thing to do is call a friend who knows more about the topic.
This collaborative process can also help improve the accuracy of large-scale language models (LLMs). Still, teaching LLMs to recognize when they should collaborate with other models for answers has been difficult. Instead of using complex formulas or vast amounts of labeled data to explain where models should work together, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) envisioned a more organic approach.
Their new algorithm, called “Co-LLM,” pairs a general-purpose basic LLM with a more specialized model to help them work together. As the former generates an answer, the Co-LLM examines each word (or token) in the response to see where it can call upon a more accurate answer from the expert model. This process leads to more accurate answers for things like medical prompts, math, and reasoning problems. It also leads to more efficient response generation because the expert model is not needed at each iteration.
To determine when the base model needs help from the expert model, the framework uses machine learning to train a “switch variable,” a tool that can represent the capabilities of each word in the responses of the two LLMs. The switch is like a project manager, identifying areas where the expert should be called in. For example, if you ask Co-LLM to give some examples of extinct bear species, the two models will draft an answer together. The general-purpose LLM starts writing the answer, and the switch variable steps in where the expert model can insert better tokens, such as adding the year the bear species became extinct.
“With Co-LLM, we essentially train a general-purpose LLM that can ‘call in’ to an expert model when needed,” says Shannon Sen, an MIT PhD student in electrical engineering and computer science, a CSAIL affiliate, and lead author of a new paper on the approach. “We use domain-specific data to teach the base model about its counterpart’s expertise in domains such as biomedical tasks and math and reasoning problems. This process automatically finds parts of the data that the base model has difficulty generating, and then instructs the base model to call in an expert LLM that has been pre-trained on data from a similar domain. The general-purpose model provides the ‘scaffolding’ generation, and when we call in the expert LLM, it tells the expert to generate the desired tokens. Our results show that the LLM organically learns collaboration patterns, similar to how humans recognize when they need to call in an expert to fill in a gap.”
Combination of flexibility and realism
Imagine asking a general LLM to tell you the ingredients of a particular prescription drug. You might get the wrong answer, or you might need the expertise of a professional model.
To demonstrate the flexibility of Co-LLM, the researchers combined the base LLM with an expert LLM from another domain, such as the Meditron model, which was pre-trained on unlabeled medical data, using data such as the BioASQ medical set. This helped the algorithm answer questions commonly asked by biomedical experts, such as naming the mechanisms that cause a particular disease.
For example, if you simply ask a simple LLM to tell you the ingredients of a particular prescription drug, it may give you the wrong answer. Adding the expertise of a model specializing in biomedical data can give you a more accurate answer. Co-LLM also tells the user where to double-check the answer.
Another example of Co-LLM’s improved performance: When given the task of solving a math problem like “a3 a2 if a=5,” the general-purpose model incorrectly calculated that the answer was 125. When Co-LLM trained the model to collaborate more with a large-scale mathematics LLM called Llemma, the two together determined that the correct answer was 3,125.
Co-LLM provided more accurate answers than the fine-tuned simple LLMs operating independently and the untuned specialized models. While Co-LLM can guide two differently trained models to work together, other effective LLM collaboration approaches, such as “proxy tuning,” require that all constituent models be trained similarly. Additionally, while this baseline requires each model to be used simultaneously to generate answers, MIT’s algorithm achieves more efficient generation by only activating the expert model for specific tokens.
When to Ask an Expert
The MIT researchers’ algorithm highlights that more closely mimicking human teamwork could improve the accuracy of multi-LLM collaboration. To further increase factual accuracy, the team could leverage human self-correction. They are considering a more robust deferral approach that would allow the expert model to backtrack when it fails to provide the correct response. This upgrade would allow Co-LLM to correct its course, so that the algorithm could still provide a satisfactory answer.
The team also wants to keep the answers as up-to-date as possible by updating the expert model (only training the base model) as new information becomes available. This allows Co-LLM to combine up-to-date information with powerful inference capabilities. Ultimately, the model can support enterprise documents using up-to-date information to update them appropriately. Co-LLM can also train a small, private model to work with the more powerful LLM to improve documents that need to remain on the server.
“Co-LLM provides an interesting approach to learning how to choose between two models to improve efficiency and performance,” says Colin Raffel, an associate professor at the University of Toronto and research director at the Vector Lab. “Because routing decisions are made at the token level, Co-LLM provides a fine-grained way to defer difficult generation steps to a more powerful model. This unique combination of model- and token-level routing provides a lot of flexibility that similar methods lack. Co-LLM contributes to an important line of work that aims to develop an ecosystem of specialized models that can outperform expensive monolithic AI systems.”
Shen wrote the paper with four other CSAIL affiliates: PhD student Hunter Lang ’17, MEng ’18; former postdoc and Apple AI/ML researcher Bailin Wang; MIT Assistant Professor of Electrical Engineering and Computer Science Yoon Kim; and Professor and Jameel Clinic member David Sontag, PhD ’10, all of whom are affiliated with the MIT-IBM Watson AI Lab. Their work was supported in part by the National Science Foundation, the National Defense Science and Engineering Graduate (NDSEG) Fellowship, the MIT-IBM Watson AI Lab, and Amazon. Their work was presented at the annual meeting of the Association for Computational Linguistics.