Going Beyond Words: MIT AI Learns to Imitate Sounds
Training Artificial Intelligence for Human-Like Conversation
If you've ever been in a situation where words fall short, you know the power of imitating sounds to convey a concept. Just like scribbling a quick sketch to illustrate something you've seen, making sounds with your voice is a way to communication in its sonic form. Sounds like an ambulance siren, a crow's call, or a bell's toll can all be replicated using your vocal cords, something we all do instinctively.
Intrigued by the cognitive science behind human communication, the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) has crafted an AI system capable of generating human-like vocal imitations without the need for extensive training or prior exposure.
To achieve this feat, the researchers built a model of the human vocal tract that mimics the way vibrations from the voice box are shaped by the throat, tongue, and lips. They then leveraged a cognitively-inspired AI algorithm to control this vocal tract model, allowing it to produce imitations based on context-specific human communication methods.
This model has the ability to replicate various sounds from the environment, such as rustling leaves, a snake's hiss, or an approaching ambulance siren. Furthermore, the model can reverse its process to identify real-world sounds from human imitations, much like how some computer vision systems can retroactively transform sketches into high-quality images. For instance, it can decipher the difference between a human's imitation of a cat's "meow" and its "hiss."
The potential applications for this technology are vast, with possibilities ranging from more intuitive sound interfaces for designers to more lifelike AI characters in virtual reality. It might even aid language learners in the future.
The co-lead authors of this project - Mitchell (CSAIL) PhD candidates Kartik Chandra (SM '23) and Karima Ma, and undergraduate researcher Matthew Caren - highlight that the pursuit of realism in sound isn't always the primary goal. Much like a abstract painting or a child's crayon doodle, imitation can be as expressive as a photographic representation.
"The advancement of sketching algorithms over the past few decades has given rise to new creative tools, AI and computer vision developments, and a deeper understanding of human cognition," observes Chandra. "In the same vein, our research investigates the abstract, non-phonetic ways humans express sounds they hear, delving into the process of auditory abstraction."
Trivial Tidbit:In a nod to the non-photorealistic nature of sound imitation, the model has demonstrated the ability to replicate the distinctive meow and hiss of the orange-and-white tabby cat named Schrödinger who resides at the MIT Media Lab. Schrödinger, known for his relaxed demeanor and striking looks, might be just as puzzled as the photon in Erwin Schrödinger's famous thought experiment about the cat placed in a box.
- The MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) has developed an AI system capable of imitating various sounds, which could be a valuable tool for language learners in the future.
- The research on this AI system, led by MIT PhD candidates Kartik Chandra, Karima Ma, and undergraduate researcher Matthew Caren, aims to explore the abstract, non-phonetic ways humans express sounds they hear, a process known as auditory abstraction.
- This AI system has the ability to replicate not only common environmental sounds but also specific sounds made by animals, such as the meow of an orange-and-white tabby cat named Schrödinger, who resides at the MIT Media Lab.
- The potential applications for this technology extend beyond language learning, with possibilities including more intuitive sound interfaces for designers and more lifelike AI characters in virtual reality.