One day, you might want a household robot to carry your dirty clothes downstairs and place them in the washing machine in the far left corner of your basement. The robot must combine visual observations with the user’s instructions to determine what steps to take to complete this task.
For AI agents, this is easier said than done. Current approaches often leverage multiple hand-crafted machine learning models to handle different parts of a task, which require enormous human effort and expertise to build. These methods, which use visual representations to directly make navigation decisions, require enormous amounts of visual data for training, which are often difficult to obtain.
To overcome these challenges, researchers at MIT and the MIT-IBM Watson AI Lab have devised a navigation method that converts visual representations into language fragments and then feeds them into one large language model that performs all parts of the multi-step navigation task.
Instead of encoding the visual features of images of the robot’s surroundings into a computationally intensive visual representation, their method generates text captions that describe the robot’s perspective. Large-scale language models use captions to predict what actions a robot should take to fulfill the user’s language-based instructions.
Because their method leverages purely language-based representations, it can efficiently generate huge amounts of synthetic training data using large-scale language models.
Although this approach does not outperform techniques that use visual features, it works well in situations where visual data for training is lacking. Researchers found that combining language-based input with visual cues improved navigation performance.
“By purely using language as a perceptual representation, our approach becomes simpler: any input can be encoded in language, producing trajectories that are understandable to humans,” says Electrical Engineering and Computer Science (EECS). )’s Bowen Pan says: He is a graduate student and lead author of a paper on this approach.
Pan’s co-authors include his advisor, Aude Oliva, director of strategic industry engagement at the MIT Schwarzman College of Computing, MIT director of the MIT-IBM Watson AI Lab, and principal investigator at the Computer Science and Artificial Intelligence Laboratory (CSAIL). ); Philip Isola, EECS associate professor and CSAIL member; Senior author Yun Kim, EECS assistant professor and CSAIL member; as well as others from the MIT-IBM Watson AI Lab and Dartmouth College. The research will be presented at the North American branch meeting of the Association for Computational Linguistics.
Solving vision problems with language
Pan said that because large-scale language models are the most powerful machine learning models, he tried to integrate them into complex tasks of vision and language exploration.
However, these models use text-based input and cannot handle visual data from robot cameras. So the team had to find a way to use language instead.
Their technique leverages a simple caption model to obtain textual descriptions of the robot’s visual observations. These captions are combined with language-based instructions and fed into a large-scale language model that determines what navigation steps the robot should take next.
After completing that step, the large language model outputs a caption for the scene the robot should see. This is used to update the trajectory history so that the robot can track its location.
The model repeats this process to generate a trajectory that guides the robot to its goal one step at a time.
To streamline the process, the researchers designed a template so that observational information is represented in a standard format in the model as a series of choices the robot can make based on its surroundings.
For example, a caption might say, “At 30 degrees to the left, there is a flower pot next to the door, and in the back is a small office with a desk and computer.” The model chooses whether the robot should move towards a direction. Whether it’s a door or an office.
“One of the biggest challenges was finding a way to encode this kind of information into language in an appropriate way so that the agent could understand what the task was and how to respond,” Pan said.
Advantages of Language
When we tested this approach, we found that it offered some advantages, although it couldn’t outperform vision-based techniques.
First, because text requires fewer computational resources to synthesize than complex image data, the method can be used to quickly generate synthetic training data. In one test, they generated 10,000 synthetic trajectories based on 10 real visual trajectories.
This technology can also fill the gap that can prevent agents trained in simulated environments from performing well in the real world. These differences often occur because computer-generated images can appear significantly different from the actual scene due to factors such as lighting or color. But the language that describes synthetic and real images will be much more difficult to distinguish, Pan says.
Additionally, the expressions the model uses are easier for humans to understand because they are written in natural language.
“If an agent fails to achieve its goal, it is easier to see where it failed and why it failed. Maybe the historical information wasn’t clear enough, or maybe we ignored some important details in our observations,” says Pan.
Additionally, because their method uses only one type of input, it can be more easily applied to a variety of tasks and environments. As long as the data can be encoded in a language, the same model can be used without any modification.
However, one drawback is that their method naturally loses some information that can be captured in vision-based models, such as depth information.
However, the researchers were surprised to find that combining language-based representations with vision-based methods improved the agent’s navigation abilities.
“Perhaps this means that language can capture higher levels of information than purely visual functions cannot,” he says.
This is one area that researchers want to continue to explore. They also want to develop a navigation-driven captioner that could improve the performance of their method. They also want to investigate the ability of large-scale language models to represent spatial awareness and see how this can help in language-based navigation.
This research is funded in part by the MIT-IBM Watson AI Lab.