Whether describing the sound of a faulty car engine or meowing like your neighbor’s cat, imitating sounds with your voice can help convey a concept when words don’t work.
Vocal mimicry is the acoustic equivalent of doodling a quick picture to convey what you saw, except that instead of using a pencil to describe an image, you use your vocal tract to represent the sound. It may seem difficult, but it’s something we all do intuitively. To experience it yourself, try imitating ambulance sirens, crows, bells, etc. with your voice.
Inspired by the cognitive science of how we communicate, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) were able to generate human-like vocal imitations without training and without having previously “heard” what a human vocal feels like. We have developed an AI system that .
To achieve this, researchers designed a system to generate and interpret sounds similar to how we do. They started by building a model of the human vocal tract that simulated how the vibrations of the vocal cords are shaped by the throat, tongue, and lips. They then used a cognitively inspired AI algorithm to control this vocal cord model and generate imitations, taking into account the context-specific ways humans choose to convey sound.
The model can effectively take in many sounds around it, such as rustling leaves, the hiss of a snake, or the siren of an approaching ambulance, and create a human-like imitation. Their model can also be run backwards to guess real sounds from imitations of human speech, similar to how some computer vision systems can retrieve high-quality images based on sketches. For example, the model can accurately distinguish between sounds that humans mimic a cat’s ‘meow’ and ‘hiss’ sounds.
In the future, this model could potentially lead to more intuitive “imitation-based” interfaces for sound designers, more human-like AI characters in virtual reality, and even a way to help students learn new languages.
Co-authors MIT CSAIL doctoral students Kartik Chandra SM ’23 and Karima Ma, and undergraduate researcher Matthew Caren note that computer graphics researchers have long recognized that realism is not the ultimate goal of visual representation. For example, an abstract painting or a child’s crayon doodles can be as expressive as a photograph.
“Over the past few decades, advances in sketching algorithms have enabled new tools for artists, advances in AI and computer vision, and even a deeper understanding of human cognition,” says Chandra. “Just as a sketch is an abstract, non-photorealistic image representation, our method captures abstract, non-phonetic images.–A realistic way for humans to represent the sounds they hear. “This teaches us about auditory abstraction processes.”
video playback
“The goal of this project was to understand and computationally model vocal imitation, which we consider to be the auditory analogue of sketching in the visual domain,” says Caren.
The Art of Imitation, in Three Parts
The team developed three increasingly nuanced versions of the model to compare to human vocal imitation. First, they created a baseline model that aimed to produce as close an imitation of real sounds as possible. However, this model did not match human behavior well.
The researchers then designed a second “communication” model. According to Caren, the model takes into account what the characteristics of the sound are to the listener. For example, you are likely to imitate the sound of a motorboat by imitating the sound of its engine. This is because although it is not the loudest part of the sound (compared to, say, water splashing), it is the most distinctive auditory characteristic of a motorboat. This second model created a better imitation of the baseline, but the team wanted to improve it further.
To take the method one step further, the researchers added a final layer of inference to the model. “Speech imitation can sound different depending on how much effort you put into it. Creating perfectly accurate sounds takes time and energy,” says Chandra. The researchers’ overall model explains this by trying to avoid very fast, loud, high-pitched or low-pitched utterances that people are less likely to use in conversation. The result is human-like imitation that closely matches many of the decisions humans make when imitating the same sounds.
After building this model, the team conducted behavioral experiments to determine whether AI- or human-generated speech imitations were better recognized by human judges. In particular, 25% of the experiment participants preferred the AI model overall, 75% preferred the AI model when imitating motorboats, and 50% preferred the AI model when imitating gunshot sounds.
Toward more expressive sound technology
Caren, who is passionate about music and art technology, said the model could help artists better communicate sound to computing systems and help filmmakers and other content creators create more nuanced AI sounds for specific situations. I think there is. Additionally, musicians can quickly search sound databases to imitate difficult-to-describe noises, such as from text prompts.
In the meantime, Caren, Chandra and Ma are investigating the implications of their model in other areas, including language development, how infants learn to speak, and behavioral imitation in birds such as parrots and songbirds.
The team is still working on the current model iteration. Difficulties with some consonants, such as “z,” resulted in inaccurate impressions of some sounds, such as a buzzing bee. Additionally, we have not yet been able to replicate how humans imitate speech, music, or sounds that are imitated differently across languages, such as heartbeat.
Robert Hawkins, a professor of linguistics at Stanford University, says language is full of onomatopoeia and words that imitate but do not completely replicate what they describe, such as the “meow” sound, which is very inaccurately similar to the sound a cat makes. “The process that takes us from real cat sounds to words like ‘meow’ reveals much about the complex interactions between physiology, social reasoning, and communication in the evolution of language,” Hawkins says. From the CSAIL study. “This model represents an exciting step toward formulating and testing theories of these processes by showing that both the physical constraints of the human vocal tract and the social pressures of communication are necessary to explain the distribution of vocal imitation.”
Caren, Chandra, and Ma co-authored the paper with two other CSAIL affiliates: Jonathan Ragan-Kelley, associate professor of electrical engineering and computer science at MIT, and Joshua Tenenbaum, professor of brain and cognitive sciences at MIT and the Center for Brains, Minds, and Machines. member. Their research was supported in part by the Hertz Foundation and the National Science Foundation. This was announced at SIGGRAPH Asia in early December.