Speech Recognition - The Batch

Graph depicts GPT-Realtime-2's performance across sectors, competing with other speech-to-speech models.

Machine Learning Research

OpenAI Challenges Speech-to-Speech Leaders: RealTime API updates audio models that reason, transcribe, and translate

An update of OpenAI’s speech-to-speech model lets developers tune the tradeoff between speed and reasoning.

Architecture of Qwen2.5-Omni showing multimodal processing with vision and audio encoders, thinker, talker, and decoder.

Machine Learning Research

Better Multimodal Performance With Open Weights: Qwen2.5-Omni 7B raises the bar for small multimodal models

Alibaba’s latest open-weights system raises the bar for multimodal tasks in a relatively small model.

Mochi-style illustrated characters with diverse facial expressions used for AI emotion recognition visualizations.

Machine Learning Research

Interactive Voice-to-Voice With Vision: MoshiVis adds image understanding to voice-first conversations

Researchers updated the highly responsive Moshi voice-to-voice model to discuss visual input.

Phi-4 Mini multimodal architecture integrating vision, audio, and text with token merging and LoRA-adapted weights for AI processing.

Machine Learning Research

Microsoft Tackles Voice-In, Text-Out: Microsoft’s Phi-4 Multimodal model can process text, images, and speech simultaneously

Microsoft debuted its first official large language model that responds to spoken input.

Amazon smart display with widgets for recipes, calendar, weather, events, and streaming (Prime Video, Netflix, Disney+).

Tech & Society

Amazon’s Next-Gen Voice Assistant: Alexa+ adds generative AI and agents, using Claude and other models

Amazon announced Alexa+, a major upgrade to its long-running voice assistant.

Diagram illustrating Moshi’s use of an LLM to process user audio input, inner monologue, and output.

Machine Learning Research

Okay, But Please Don’t Stop Talking: Moshi, an open alternative to OpenAI’s Realtime models for speech

Even cutting-edge, end-to-end, speech-to-speech systems like ChatGPT’s Advanced Voice Mode tend to get interrupted by interjections like “I see” and “uh-huh” that keep human conversations going. Researchers built an open alternative that’s designed to go with the flow of overlapping speech.

DoNotPay's system that autonomously navigates phone menus and converses with customer service representatives working

Business

Your Personal Deepfaked Agent: This GPT-powered voice tool will talk to customer service for you.

Hate talking to customer service? An AI-powered tool may soon do it for you. Joshua Browder, chief executive of the consumer advocacy organization DoNotPay, demonstrated a system that autonomously navigates phone menus and converses...

Moving slide with information about AWS AI Service Cards.

Tech & Society

Transparency for AI as a Service: Amazon introduces service cards to enhance responsible AI.

Amazon published a series of web pages designed to help people use AI responsibly. Amazon Web Services introduced so-called AI service cards that describe the uses and limitations of some models it serves.

Image of body parts in Hokkien, map showing Hokkien speaking regions across the world and Model architecture of S2ST

Tech & Society

Translating a Mostly Oral Language: How Meta Trained an NLP Model to Translate Hokkein

Most speech-to-speech translation systems use text as an intermediate mode. So how do you build an automated translator for a language that has no standard written form? A new approach trained neural networks to translate a primarily oral language.

Illustration of the Dialogue Transformer Language Model (DLM)

Machine Learning Research

The Sound of Conversation: AI Learns to Mimic Conversational Pauses and Interruptions

In spoken conversation, people naturally take turns amid interjections and other patterns that aren’t strictly verbal. A new approach generated natural-sounding audio dialogs without training on text transcriptions that mark when one party should stop speaking and the other should chime in.

Tech & Society

Toward Machines That LOL: Scientists Teach a Speech Recognition Robot to Laugh

Even if we manage to stop robots from taking over the world, they may still have the last laugh. Researchers at Kyoto University developed a series of neural networks that enable a robot engaged in spoken conversation to chortle along with its human interlocutor.

Business

Speaking Your Language: Startup Papercup Offers AI-Powered Voice Translation

A startup that automatically translates video voice overs into different languages is ready for its big break. London-based Papercup offers a voice translation service that combines algorithmic translation and voice synthesis with human-in-the-loop quality control.