Despite their impressive capabilities, large-scale language models are not perfect. These artificial intelligence models sometimes produce inaccurate or unsupported information in response to queries, causing “hallucinations.”
Because of these hallucination issues, LLM’s responses are often verified by human fact-checkers. This is especially true if your model is deployed in a high-risk environment such as healthcare or finance. However, validation processes typically require people to read lengthy documents cited by the model, a task so cumbersome and error-prone that some users may never deploy generative AI models in the first place.
To help human verifiers, MIT researchers created a user-friendly system that allows people to verify LLM’s responses much faster. LLM uses this tool, called SymGen, to generate responses with citations that point directly to the location of the source document, such as a specific cell in the database.
Users can hover their mouse over a highlighted part of the text response to see the data the model used to generate a specific word or phrase. At the same time, it shows the user any phrases that require additional attention to check and confirm what is not highlighted.
“We give people the ability to selectively focus on parts of text that they should be more concerned about. “In the end, SymGen can give people higher confidence in the model’s responses because they can look closely at the information to make sure it’s verified,” said David McElroy, a graduate student in electrical engineering and computer science and thesis on SymGen.
Through user research, Shen and his colleagues found that SymGen reduced verification times by about 20 percent compared to manual procedures. By making it faster and easier for humans to verify model results, SymGen can help people identify errors in LLMs deployed in a variety of real-world situations, from generating clinical notes to summarizing financial market reports.
Shen is joined on the paper by co-author and fellow EECS graduate student Lucas Torroba Hennigen. EECS graduate student Aniruddha “Ani” Nrusimha; Bernhard Gapp, Chairman of the Good Data Initiative; Senior author David Sontag, EECS Professor, MIT Jameel Clinic Member, and Clinical Machine Learning Group Leader at the Computer Science and Artificial Intelligence Laboratory (CSAIL); Yun Kim, EECS assistant professor and CSAIL member. This research was recently presented at the Language Modeling Conference.
symbolic reference
To aid verification, many LLMs are designed to generate citations that point to external documents, along with language-based responses, for users to verify. But these verification systems are typically designed as an afterthought, without taking into account the effort it takes for people to sift through numerous citations, Shen says.
“Generative AI is intended to reduce the time it takes users to complete tasks. “It’s not helpful to actually connect generations if you have to spend hours reading all these documents to make sure the model is saying something reasonable,” Shen says.
The researchers approached the validation problem from the perspective of the humans who would perform the task.
SymGen users first provide LLM with data that can be referenced in the response, such as a table containing statistics from a basketball game. Then, rather than asking the model to immediately complete a task, such as generating a game summary from that data, researchers take an intermediate step. This leads the model to generate responses in symbolic form.
This prompt requires you to write a specific cell in the data table that contains the information your model is referencing every time it attempts to quote a word in its response. For example, if the model wants to quote the phrase “Portland Trailblazers” in its response, it replaces that text with the name of the cell in the data table that contains that word.
“Because there is an intermediate step containing the text in symbolic form, you can have very detailed references. For every text range in the output, we can say that this is exactly its location in the data,” says Torroba Hennigen.
SymGen then resolves each reference using a rule-based tool that copies the corresponding text from the data table into the model’s response.
“This way we know that it is a verbatim copy, so there will be no errors in the parts of the text that correspond to the actual data variables,” adds Shen.
Simplify verification
The model can produce symbolic responses due to the way it was trained. Large-scale language models are fed large amounts of data from the Internet, some of which is written in a “placeholder format” where code replaces the actual values.
A similar structure is used when SymGen prompts the model to generate a symbolic response.
“We design our prompts in a specific way to leverage the capabilities of the LLM,” Shen adds.
In the user study, the majority of participants stated that SymGen made it easier to verify text generated in LLM. We were able to validate the model’s response approximately 20% faster than using standard methods.
However, SymGen is limited by the quality of the source data. LLMs may cite incorrect variables and human verifiers may be none the wiser.
Users must also have source data in a structured format, such as tables, to provide to SymGen. Currently, the system only works with tabular data.
In the future, researchers are improving SymGen so that it can handle arbitrary text and other forms of data. For example, this feature could help validate parts of AI-generated legal document summaries. They also plan to test SymGen with doctors to study how it can identify errors in AI-generated clinical summaries.
This work is funded in part by Liberty Mutual and the MIT Quest for Intelligence Initiative.