Machine Learning Research - The Batch | DeepLearning.AI (Page 10)

Architecture of Qwen2.5-Omni showing multimodal processing with vision and audio encoders, thinker, talker, and decoder.

Machine Learning Research

Better Multimodal Performance With Open Weights: Qwen2.5-Omni 7B raises the bar for small multimodal models

Alibaba’s latest open-weights system raises the bar for multimodal tasks in a relatively small model.

Llama 4 Behemoth benchmark chart comparing coding, reasoning, and multilingual scores with Claude, Gemini, and GPT-4.5.

Machine Learning Research

Llama’s Mixture of Vision-Language Experts: Meta releases Llama 4 models, claims edge over AI competitors

Meta updated its popular open-weights models, claiming performance superior to closed competitors in three size classes.

Diagram comparing original transformer model with a replacement model using token-level attention and neuron-level outputs.

Machine Learning Research

Ordinary LLMs Implicitly Take Reasoning Steps: Anthropic experiment finds Claude shows signs of unprompted reasoning

Even without explicit training in reasoning, large language models “think” in ways that may be more deliberate than previously understood.

3D scene comparison of human-object interaction for ZeroHSI, LINGO, and CHOIS models in a synthetic indoor environment.

Machine Learning Research

Human Action in 3D: Stanford researchers use generated video to animate 3D interactions without motion capture

AI systems designed to generate animated 3D scenes that include active human characters have been limited by a shortage of training data, such as matched 3D scenes and human motion-capture examples. Generated video clips can get the job done without motion capture.

Mochi-style illustrated characters with diverse facial expressions used for AI emotion recognition visualizations.

Machine Learning Research

Interactive Voice-to-Voice With Vision: MoshiVis adds image understanding to voice-first conversations

Researchers updated the highly responsive Moshi voice-to-voice model to discuss visual input.

Visual model aligning diffusion embeddings with DINOv2 encoders using REPA and DiT/SiT blocks.

Machine Learning Research

Faster Learning for Diffusion Models: Pretrained embeddings accelerate diffusion transformers’ learning

Diffusion transformers learn faster when they can look at embeddings generated by a pretrained model like DINOv2.

Diagram comparing diffusion, flow matching, and shortcut models for image generation with fewer steps.

Machine Learning Research

Better Images in Fewer Steps: Researchers introduce shortcut models to speed up diffusion

Diffusion models usually take many noise-removal steps to produce an image, which takes time at inference. There are ways to reduce the number of steps, but the resulting systems are less effective. Researchers devised a streamlined approach that doesn’t sacrifice output quality.

Comparison table of Gemini and Gemma models across benchmarks like MMLU, MATH, and GPQA with radar charts.

Machine Learning Research

Vision-Language, Compact and Open: Google releases Gemma 3 vision-language models with open weights

Google updated its open-weights family of large language models to include versions that handle image and video inputs.

Scientific diagram of a denoising model generating stable materials from random elements based on chemistry and symmetry

Science

Designer Materials: MatterGen, a diffusion model that designs new materials with specified properties

Materials that have specific properties are essential to progress in critical technologies like solar cells and batteries. A machine learning model designs new materials to order.

AI co-scientist workflow diagram showing a research goal assigned to specialized AI agents for hypothesis testing and ranking

Machine Learning Research

Science Research Proposals Made to Order: AI Co-Scientist, an agent that generates research hypotheses, aiding drug discovery

An AI agent synthesizes novel scientific research hypotheses. It's already making an impact in biomedicine.

AYA Vision architecture diagram showing vision encoder, multimodal merging, and LLM backbone for image processing

Machine Learning Research

Equally Fluent in Many Languages: Cohere’s Aya Vision beats multilingual rivals in text & image understanding

Multilingual AI models often suffer uneven performance across languages, especially in multimodal tasks. A pair of lean models counters this trend with consistent understanding of text and images across major languages.

AI model performance benchmark comparing R1 1776 and DeepSeek-R1 across MMLU, DROP, MATH-500, and AIME 2024 tests.

Tech & Society

DeepSeek-R1 Uncensored: Perplexity launches uncensored version of DeepSeek-R1

Large language models built by developers in China may, in some applications, be less useful outside that country because they avoid topics its government deems politically sensitive. A developer fine-tuned DeepSeek-R1 to widen its scope without degrading its overall performance.