Speech Processing

The research goal for speech at Google aligns with our company mission: to organize the world’s information and make it universally accessible and useful. Our pioneering research work in speech processing has enabled us to build automatic speech recognition (ASR) and text-to-speech (TTS) systems that are used across Google products, with support for more than a hundred language varieties spoken across the globe. From Gboard dictation to transcriptions of voice notes, from YouTube captions to team meetings without language barriers, and from Google Maps speaking directions aloud to Google Assistant reading the news, Google’s speech research has unparalleled reach and impact. We aim to solve speech for everyone, everywhere – and work to further improve quality, speed and versatility across all kinds of speech. We're also committed to expanding our language coverage, and have set a moonshot goal to build speech technologies for 1,000 languages.

Google's speech research efforts push the state-of-the-art on architectures and algorithms used across areas like speech recognition, text-to-speech synthesis, keyword spotting, speaker recognition, and language identification. The systems we build are deployed on servers in Google’s data centers but also increasingly on-device. The team has a passion for research that leads to product advances for the billions of users that use speech in Google products today. We also release academic publications and open-source projects for the broader research community to leverage.

Our speech technologies are embedded in products like the Assistant, Search, Gboard, Translate, Maps, YouTube, Cloud, and many more. Thanks to close collaborations with product teams, we are in a unique position to deliver user-centric research. Our researchers can conduct live experiments to test and benchmark new algorithms directly in a realistic controlled environment. Whether these are algorithmic improvements or user experience and human-computer interaction studies, we focus on solving real problems with real impact on users.

We value our user diversity, and have made it a priority to deliver the best performance to every language and language variety. Today, our speech systems operate in more than 130 language varieties, and we continue to expand our reach. The challenges of internationalizing at scale are immense and rewarding. We are breaking new ground by deploying speech technologies that help people communicate, access information online, and share their knowledge – all in their language. And combined with the unprecedented translation capabilities of Google Translate, we are also at the forefront of research in speech-to-speech translation and one step closer to a universal translator.

Recent Publications

Preview abstract Speech recognition models, predominantly trained on standard speech, often exhibit lower accuracy for individuals with accents, dialects, or speech impairments. This disparity is particularly pronounced for economically or socially marginalized communities, including those with disabilities or diverse linguistic backgrounds. Project Euphonia, a Google initiative originally launched in English dedicated to improving Automatic Speech Recognition (ASR) of disordered speech, is expanding its data collection and evaluation efforts to include international languages like Spanish, Japanese, French and Hindi, in a continued effort to enhance inclusivity. This paper presents an overview of the extension of processes and methods used for English data collection to more languages and locales, progress on the collected data, and details about our model evaluation process, focusing on meaning preservation based on Generative AI. View details
Binamix -- A Python Library for Generating Binaural Audio Datasets
Dan Barry
Davoud Shariat Panah
Alessandro Ragano
Andrew Hines
AES 158th Audio Engineering Society Convention (2025)
Preview abstract The increasing demand for spatial audio in applications such as virtual reality, immersive media, and spatial audio research necessitates robust solutions to generate binaural audio data sets for use in testing and validation. Binamix is an open-source Python library designed to facilitate programmatic binaural mixing using the extensive SADIE II Database, which provides Head Related Impulse Response (HRIR) and Binaural Room Impulse Response (BRIR) data for 20 subjects. The Binamix library provides a flexible and repeatable framework for creating large-scale spatial audio datasets, making it an invaluable resource for codec evaluation, audio quality metric development, and machine learning model training. A range of pre-built example scripts, utility functions, and visualization plots further streamline the process of custom pipeline creation. This paper presents an overview of the library’s capabilities, including binaural rendering, impulse response interpolation, and multi-track mixing for various speaker layouts. The tools utilize a modified Delaunay triangulation technique to achieve accurate HRIR/BRIR interpolation where desired angles are not present in the data. By supporting a wide range of parameters such as azimuth, elevation, subject Impulse Responses (IRs), speaker layouts, mixing controls, and more, the library enables researchers to create large binaural datasets for any downstream purpose. Binamix empowers researchers and developers to advance spatial audio applications with reproducible methodologies by offering an open-source solution for binaural rendering and dataset generation. We release the library under the Apache 2.0 License at https://github.com/QxLabIreland/Binamix/ View details
Perceptual Evaluation of a Mix Presentation for Immersive Audio with IAMF
Carlos Tejeda-Ocampo
Toni Hirvonen
Ema Souza-Blanes
Mahmoud Namazi
AES 158th Convention of the Audio Engineering Society (2025)
Preview abstract Immersive audio mix presentations involve transmitting and rendering several audio elements simultaneously. This enables next-generation applications, such as personalized playback. Using immersive loudspeaker and headphone MUSHRA tests, we investigate bitrate vs. quality for a typical mix presentation use case of a foreground stereo element, plus a background Ambisonics scene. For coding, we use Immersive Audio Model and Formats, a recently proposed system for Next-Generation Audio. Excellent quality is achieved at 384 kbit/s even with reasonable amount of personalization. We also propose a framework for content-aware analysis that can significantly reduce the bitrate when using underlying legacy audio coding instances. View details
On the Design of the Binaural Rendering Library for Eclipsa Audio Immersive Audio Container
Tomasz Rudzki
Gavin Kearney
AES 158th Convention of the Audio Engineering Society (2025)
Preview abstract Immersive Audio Media and Formats (IAMF), also known as Eclipsa Audio, is an open-source audio container developed to accommodate multichannel and scene-based audio formats. Headphone-based delivery of IAMF audio requires efficient binaural rendering. This paper introduces the Open Binaural Renderer (OBR), which is designed to render IAMF audio. It discusses the core rendering algorithm, the binaural filter design process as well as real-time implementation of the renderer in a form of an open-source C++ rendering library. Designed for multi-platform compatibility, the renderer incorporates a novel approach to binaural audio processing, leveraging a combination of spherical harmonic (SH) based virtual listening room model and anechoic binaural filters. Through its design, the IAMF binaural renderer provides a robust solution for delivering high-quality immersive audio across diverse platforms and applications. View details
A Novel CI Coding Strategy Based on a Cochlear Model and Deep Neural Network
Maryam Hosseini
Tim Brochier
Zachary Smith
Brett Swanson
Andrew Vandali
Alan Kan
Fadwa Alnafjan
Kat Fernandez
Conference on Implantable Auditory Prostheses 2025
Preview abstract Objective: Many CI recipients face difficulties in understanding speech in noisy environments and express frustration with the quality of music. This may be partly due to the simple filter banks used in current CI technology, which do not fully replicate the natural processes of the cochlea. This project aims to improve CI perception by more accurately mimicking the responses of the auditory nerve. Method: Audio signals were applied to CARFAC (Cascade of Asymmetric Resonators with Fast-Acting Compression) [1] to produce a representation of the auditory nerve response, known as a normal hearing (NH) “neurogram”. The NH neurogram was down-sampled and applied to a deep neural network (DNN) to produce 22 electrode stimulation currents. These currents were applied to an electrical hearing (EH) model incorporating current spread, neural adaptation, and refractoriness, to produce a CI neurogram. The DNN was trained on sentences from the TIMIT database to minimise the difference between the NH and CI neurograms. Results: The CI neurograms produced by the CARFAC-DNN strategy were more similar to the NH neurograms than the CI neurograms produced by the Nucleus ACE strategy. Similarity was quantified by the structural similarity index and mean squared error. Conclusions: The CARFAC-DNN strategy may provide a more natural auditory nerve response than traditional CI sound coding strategies. A sound-booth study with CI recipients is planned. This work was funded by Google through the Australian Future Hearing Initiative. References: [1]  Lyon, R. F. (2017). Human and machine hearing. Cambridge University Press. View details
Preview abstract Inter-sentence pauses are the silences that occur between sentences in a paragraph or a dialogue. They are an important aspect of long-form speech prosody, as they can affect the naturalness, intelligibility, and effectiveness of communication. However, the user perception of inter-sentence pauses in long-form speech synthesis is not well understood. Previous work often evaluates pause modelling in conjunction with other prosodic features making it hard to explicitly study how raters perceive differences in inter-sentence pause lengths. In this paper, using multiple text-to-speech (TTS) datasets that cover different content types, domains, and settings, we investigate how sensitive raters are to changes to the durations of inter-sentence pauses in long-form speech by comparing ground truth audio samples with renditions that have manipulated pause durations. This experimental design is meant to allow us to draw conclusions regarding the utility that can be expected from similar evaluations when applied to synthesized long-form speech. We find that, using standard evaluation methodologies, raters are not sensitive to variations in pause lengths unless these deviate exceedingly from the norms or expectations of the speech context. View details
×
Image