Zoom calls, meetings in the metaverse and virtual events could all be improved in the future thanks to a series of AI models developed by engineers at Metawhich the company says match sound to imagery, mimicking the way humans experience sound in the real world.
The three models, developed in partnership with researchers from the University of Texas at Austin, are known as Visual-Acoustic Matching, Visually-Informed Dereverberation and VisualVoice. Meta has made the models available for developers.
“We need AI models that understand a person’s physical surroundings based on both how they look and how things sound,” the company said in a blog post explaining the new models.
“For example, there’s a big difference between how a concert would sound in a large venue versus in your living room. That’s because the geometry of a physical space, the materials and surfaces in the area, and the proximity of where the sounds are coming from all factor into how we hear audio.”
Meta’s new audio AI models
The Visual Acoustic-Matching model can take an audio clip recorded anywhere, along with an image of a room or other space, and transform the clip to make it sound like it was recorded in that room.
An example use case for this could be to ensure people in a video chat experience sound the same way. So if one is at home, another in a coffee shop and a third in an office the sound could be adapted so that what you hear comes across as if it were in the room you are sitting in.
Visually-Informed Dereverberation is a model that does the opposite, it takes sounds and visual cues from a space, then focuses on removing reverberation from the space. For example, it can focus on the music from a violin even if it is recorded inside a large train station.
Finally, the VisualVoice model uses visual and audio cues to split speech from other background sounds and voices, allowing the listener to focus on a specific conversation. This could be used in a large conference hall with lots of people mingling.
This focused audio technique could also be used to generate better quality subtitles or make it easier for the future machine learning to understand speech output when more than one person is talking, Meta explained.
Improving audio in virtual experiences
Rob Goodman, reader in music at the University of Hertfordshire and an expert in acoustic spaces, told Tech Monitor This work feeds into a human need to understand where we are in the world and brings it to virtual settings.
“We have to think about how humans perceive sound in their environment,” Goodman says. “Human beings want to know where sound is coming from, how big a space is and how small a space is. When listening to sound being created we listen to several different things. One is the source, but you also listen to what happens to sound when combined with the room – the acoustics.”
Being able to capture and mimic that second aspect correctly could make virtual worlds and spaces seem more realistic, he explains, and do away with the disconnect humans might experience if the visuals don’t accurately match the audio.
An example of this could be a concert where a choice is performing outdoors, but the actual audio is recorded inside a cathedral, complete with significant reverb. That reverb wouldn’t be expected on a beach, so the mismatch of sound and visual would be unexpected and off putting.
Goodman said the biggest change is how the perception of the listener is considered when implementing these AI models. “The position of the listener needs to be thought out a great deal,” he says. “The sound made close to a person compared to metres away is important. It is based around the speed of sound in air so a small delay in the time it takes to get to a person is utterly crucial.”
He said part of the problem with improving audio is the lack of end-user equipment, explaining users will “spend thousands of pounds on curved monitor but won’t pay more than £20 for a pair of headphones”.
Professor Mark Plumbley, EPSRC Fellow in AI for Sound at the University of Surrey, is developing classifiers for different types of sounds so they can be removed or highlighted in recordings. “If you are going to create this realistic experience for people you need the vision and sound to match,” he says.
“It is harder for a computer than I think it would be for people. When we are listening to sounds there is an effect called directional marking that helps us focus on the sound from somebody in front of us and ignore sounds from the side.
This is something we’re used to doing in the real world, Plumbley says. “If you are in a cocktail party, with lots of conversations going on, you can focus on the conversation of interest, we can block out sounds from the side or elsewhere,” he says. “This is a challenging thing to do in a virtual world.”
He says a lot of this work has come about because of changes in machine learning, with better deep learning techniques that work across different disciplines, including sound and image AI. “A lot of these things are related to signal processing,” Plumbley adds.
“Whether sounds, gravitational waves or time series information from financial data. They are about signals that come over time. In the past researchers had to build individual ways for different types of objects to extract out different things. Now we are finding deep learning models are able to pull out the patterns.”