SAN FRANCISCO — The hardest problem in simultaneous interpretation has never been vocabulary. It is timing. A human interpreter must decide, hundreds of times per minute, whether to wait for the speaker to finish a clause before translating it, or to begin speaking with partial information and risk getting the ending wrong. Google, two decades into the machine-translation business, has now published its answer to that same question — and the answer is: lean toward speed and correct as you go.
The company on Monday released Gemini 3.5 Live Translate, an audio model designed for live speech-to-speech translation across more than 70 languages. Unlike the turn-by-turn systems that have defined the category for years, this model does not wait for a speaker to finish a sentence before producing translated audio. It generates speech continuously, staying just a few seconds behind the speaker, balancing — in the engineers’ own framing — the trade-off between waiting for context to improve quality and translating immediately to stay in sync. That gap, a handful of seconds, is the product’s defining characteristic. It is also the thing Google cannot yet eliminate.
The model is available starting Tuesday through the Gemini Live API and Google AI Studio in public preview, rolling out to the Google Translate app on Android and iOS, and entering private preview this month for enterprise customers in Google Meet. A broader Meet rollout is expected later in the year. The speed of simultaneous rollout across three distinct surfaces — developer API, consumer app, enterprise conferencing — signals that Google views this not as a feature but as a platform capability it intends to plant across its product stack before competitors find their footing.
What makes the framing of this release unusual is that Google chose not to characterize the latency as a problem solved. The official blog post, authored by product manager Anuda Weerasinghe and senior staff software engineer Tony Lu, describes the model as one that stays just a few seconds behind the speaker throughout the session — an honest accounting of the technology’s current ceiling, and a notable departure from the overclaiming that typically accompanies AI product launches. Whether that candor reflects engineering discipline or calibrated expectation management is not yet answerable.
The most concrete test of the model’s real-world utility comes not from Google’s own infrastructure but from Grab, the Southeast Asian super-app that is piloting the technology to handle multilingual communication between drivers and travelers at pickups. Grab’s users make more than 10 million voice calls per month through the platform. Those calls are short, high-stakes, frequently noisy, and conducted between speakers of languages that span some of the most linguistically complex terrain on earth — Thai, Vietnamese, Bahasa Indonesia, Tagalog. If the model performs under those conditions, the claim of industrial-scale readiness will carry more weight than any controlled demonstration. Grab has not yet published outcome data.
The model automatically detects language without requiring users to configure input settings, a practical improvement over earlier tools that required selecting source and target languages in advance. It preserves the speaker’s intonation, pacing, and pitch in the translated output, a capability that matters considerably in languages where prosody carries meaning — and where a flat, machine-cadence translation can mislead even when the words are accurate. Noise robustness is built in, Google said, designed for the unpredictable acoustic environments that consumer and enterprise use alike routinely produce.
For Google Meet, the update removes one of the most significant structural limitations of the existing speech translation feature: its confinement to five languages and its insistence on routing all translation through English as an intermediary. The new model supports translation across more than 2,000 language combinations in a single meeting — a figure that expands the practical use case from multinational corporations holding English-anchored calls to genuinely multilingual conversations in which no participant necessarily speaks or understands English. That distinction, for education, diplomacy, healthcare, and cross-border legal proceedings, is not incremental. It is architectural.
For Android users specifically, Google is introducing what it calls a listening mode: translated audio streams directly through the phone’s earpiece, allowing users to hold the device to their ear as they would during a normal call. The feature is aimed at situations where headphones are unavailable, and it removes the friction of having translated speech play through the speaker where both parties would hear it. It is a small interaction design choice, but one that reflects genuine attention to the contexts in which people actually need translation — at a pharmacy counter, in a hospital waiting room, at a border crossing — rather than the idealized conditions in which the technology demos cleanly.
On the safety architecture, Google has embedded SynthID watermarking into all audio generated by the model. The watermark is imperceptible to listeners and is woven directly into the output audio, ensuring that translated speech can be identified as AI-generated. That matters most in contexts where the translation will be recorded or rebroadcast — a court proceeding, a regulatory hearing, a televised interview — where the provenance of speech carries legal or journalistic weight. The model card published by Google DeepMind details the safety and responsibility approach.
The release arrives as Google’s translation infrastructure has grown to process more than a trillion words per month across its products. The company traces the origin of machine translation internally to one of its earliest machine learning experiments, now twenty years into its commercial life. That institutional history cuts in two directions. It means Google arrives at this moment with unmatched scale in deployment and data. It also means the company has had two decades to observe where automated translation fails its users — in the nuance of formal versus informal register, in the carry of emotional tone, in the irreversible consequences of a mistranslated legal term or medical instruction. Whether a model that deliberately accepts a latency trade-off in the name of conversational fluency has resolved those deeper failure modes remains the question the data from Grab and from enterprise Meet customers will, eventually, have to answer.
Developer platforms including Agora, Fishjam, LiveKit, Pipecat, and Vision Agents have already integrated the Gemini Live API, handling real-time media streaming infrastructure and allowing developers to build on the translation layer without managing the underlying transport. Google has also published earlier this year a related milestone for its translation products. On the AI hardware and software stack that underpins this work, the company has also been advancing its broader ecosystem of AI tools including Google’s AI glasses and Android XR, unveiled at Google I/O 2026 alongside a suite of Gemini-powered capabilities.
The enterprise private preview in Google Meet will reach a limited number of business Google Workspace customers this month. Google has not specified when general availability is expected, saying only that a broader rollout will follow later in 2026. The consumer rollout on Google Translate for Android and iOS is available now. The public developer preview is live in Google AI Studio and via the Gemini Live API documentation

