AI-Powered Phones Get Proactive, Robot Antelope Joins Herd, LLM Environmental Impacts Get Measured

Loading the Elevenlabs Text to Speech AudioNative Player...

Dear friends,

Parallel agents are emerging as an important new direction for scaling up AI. AI capabilities have scaled with more training data, training-time compute, and test-time compute. Having multiple agents run in parallel is growing as a technique to further scale and improve performance.

We know from work at Baidu by my former team, and later OpenAI, that AI models’ performance scales predictably with the amount of data and training computation. Performance rises further with test-time compute such as in agentic workflows and in reasoning models that think, reflect, and iterate on an answer. But these methods take longer to produce output. Agents working in parallel offer another path to improve results, without making users wait.

Reasoning models generate tokens sequentially and can take a long time to run. Similarly, most agentic workflows are initially implemented in a sequential way. But as LLM prices per token continue to fall — thus making these techniques practical — and product teams want to deliver results to users faster, more and more agentic workflows are being parallelized.

Some examples:

  • Many research agents now fetch multiple web pages and examine their texts in parallel to try to synthesize deeply thoughtful research reports more quickly.
  • Some agentic coding frameworks allow users to orchestrate many agents working simultaneously on different parts of a code base. Our short course on Claude Code shows how to do this using git worktrees.
  • A rapidly growing design pattern for agentic workflows is to have a compute-heavy agent work for minutes or longer to accomplish a task, while another agent monitors the first and gives brief updates to the user to keep them informed. From here, it’s a short hop to parallel agents that work in the background while the UI agent keeps users informed and perhaps also routes asynchronous user feedback to the other agents.

It is difficult for a human manager to take a complex task (like building a complex software application) and break it down into smaller tasks for human engineers to work on in parallel; scaling to huge numbers of engineers is especially challenging. Similarly, it is also challenging to decompose tasks for parallel agents to carry out. But the falling cost of LLM inference makes it worthwhile to use a lot more tokens, and using them in parallel allows this to be done without significantly increasing the user’s waiting time.

I am also encouraged by the growing body of research on parallel agents. For example, I enjoyed reading “CodeMonkeys: Scaling Test-Time Compute for Software Engineering” by Ryan Ehrlich and others, which shows how parallel code generation helps you to explore the solution space. The mixture-of-agents architecture by Junlin Wang is a surprisingly simple way to organize parallel agents: Have multiple LLMs come up with different answers, then have an aggregator LLM combine them into the final output.

There remains a lot of research as well as engineering to explore how best to leverage parallel agents, and I believe the number of agents that can work productively in parallel — like the humans who can work productively in parallel — will be very high.

Keep building!

Andrew


A MESSAGE FROM DEEPLEARNING.AI

Automate knowledge graph construction using AI agents. Implement a multi-agent system that captures goals, selects relevant data, generates schemas for structured and unstructured sources, and constructs and connects the resulting graphs into a queryable knowledge graph. Enroll for free

News

Proactive AI Assistance for Phones

Google’s latest smartphone sports an AI assistant that anticipates the user’s needs and presents helpful information without prompting.

What’s new: Google unveiled its Pixel 10 along with an AI-powered system called Magic Cue. During calling, texting, and other interactions, the system automatically delivers relevant information — dates, times, names, locations, weather, photos, airline booking numbers, and so on — culled from compatible applications.

How it works: Magic Cue takes advantage of an updated version of Gemini Nano and runs on the Pixel 10’s newly upgraded Tensor G5 AI processor. The system tracks user behavior and provides relevant information proactively.

  • Magic Cue does not require wake words or prompts. It runs in the background and responds to the phone’s state from moment to moment.
  • The system’s output appears within the current app as a floating overlay window.
  • In an example provided by Google, if a user receives a text asking when their flight is scheduled to land, Magic Cue will access the user’s itinerary, extract relevant details, and offer the opportunity to insert them into a reply. If the user calls the airline to change the flight, the system will respond by displaying the flight information.

Behind the news: Google has been especially aggressive in building AI into phones. In 2021, it replaced the Qualcomm Snapdragon chip that had run AI inference on Pixel phones with its own Tensor chip, which combined a GPU, CPU, Tensor processing unit, and security subsystem. Three years later, the Pixel 8’s Tensor G3 chip provided the muscle for AI-enabled audio and video editing — but those capabilities were features within applications. Equipped with the new Tensor G5, Pixel 10 integrates AI with the operating system and applications to provide new kinds of capabilities.

Why it matters: Enabling edge devices to run powerful AI models has been a longstanding goal of big tech companies. But a smartphone’s relatively meager computational, storage, and battery resources have presented serious challenges. The combination of Gemini Nano and the Tensor G5 chip gives Google a strong foundation to keep pushing the limits of edge AI, and its control of the Android operating system gives it tremendous market power to promote its models.

We’re thinking: Apple has noticed Google’s progress. It’s reportedly negotiating with Google to use Gemini technology for its Siri AI assistant.


Mistral Measures LLM Consumption of Energy, Water, and Materials

The French AI company Mistral measured the environmental impacts of its flagship large language model.

What’s new: Mistral published an environmental analysis of Mistral Large 2 (123 billion parameters) that details the model’s emission of greenhouse gases, consumption of water, and depletion of resources, taking into account all computing and manufacturing involved. The company aims to establish a standard for evaluating the environmental impacts of AI models. The study concluded that, while individual uses of the model have little impact compared to, say, using the internet, aggregate use takes a significant toll on the environment.

How it works: Mistral tracked the model’s operations over 18 months. It tallied impacts caused by the building of data centers, manufacturing and transporting servers, training and running the model, the user’s equipment, and indirect impacts of using the model. The analysis followed the Frugal AI methodology developed by Association Française de Normalisation, a French standards organization. Environmental consultancies contributed to the analysis, and environmental auditors peer-reviewed it.

  • Training Mistral Large 2 produced 20,400 metric tons of greenhouse gases. That’s roughly equal to annual emissions from 4,400 gas-powered passenger vehicles
  • Training consumed 281,000 cubic meters of water for cooling via evaporation, roughly as much as the average U.S. family of four would consume in 500 years. (1.5 cubic meters per day x 365 days x 500 years.)
  • Training and inference accounted for 85.5 percent of the model’s greenhouse-gas emissions, 91 percent of its water consumption, and 29 percent of materials consumption including energy infrastructure.
  • Manufacturing, transporting, and decommissioning servers accounted for 11 percent of greenhouse gas emissions, 5 percent of water consumption, and 61 percent of overall materials consumed.  
  • Network traffic came to less than 1 percent of each of the three measures.
  • The average prompt and response (400 tokens or a page of text) emitted 1.14 grams of greenhouse gases, about the amount produced by watching a YouTube clip (10 seconds in the U.S. or 55 seconds in France where low-emissions nuclear energy is more widely available), and consumed 45 milliliters or 3 tablespoons of water. The total materials consumption was roughly equivalent to manufacturing a 2 Euro coin.

Yes, but: Mistral acknowledged a few shortcomings of the study. It struggled to calculate some impacts due to the lack of data and established standards. For instance, a reliable assessment of the environmental impact of GPUs is not available.

Behind the news: Mistral’s report follows a string of studies that assess AI’s environmental impact.

  • While AI is likely to consume increasing amounts of energy, it could also produce huge energy savings in coming years, according to a report by the International Energy Agency, which advises 44 countries on energy policy. 
  • A 2023 analysis by University of California and University of Texas quantified GPT-3-175B’s consumption of water. The conclusions of that work align with those of Mistral’s analysis.
  • A 2021 paper identified ways to make AI models up to a thousand-fold more energy-efficient by streamlining architectures, upgrading hardware, and boosting the energy efficiency of data centers. 

Why it matters: AI consumes enormous amounts of energy and water, and finding efficient ways to train and run models is critical to ensure that the technology can benefit large numbers of people. Mistral’s approach provides a standardized approach to assessing the environmental impacts. If it’s widely adopted, it could help researchers, businesses, and users compare different models, work toward more environmentally friendly AI, and potentially reduce overall impacts.

We’re thinking: Data centers and cloud computing are responsible for 1 percent of the world’s energy-related greenhouse gas emissions, according to the International Energy Agency. That’s a drop in the bucket compared to agriculture, construction, or transportation. Nonetheless, having a clear picture of AI’s consumption of resources can help us manage them more effectively as demand rises. It's heartening that major AI companies are committed to using and developing sustainable energy sources and using them efficiently, and the environmental footprint of new AI models and processors is falling steadily.


Robot Antelope Joins Herd

Researchers in China disguised a quadruped robot as a Tibetan antelope to help study the animals close-up.

What’s new: The Chinese Academy of Sciences teamed with Hangzhou-based Deep Robotics and the state news service Xinhua to introduce a robot into a herd that lives in the Hoh Xil National Nature Reserve, a mountainous area where the elevation is above 14,000 feet. The robot enables scientists to observe the shy antelopes without disturbing them.

How it works: The mechanical beast is a Deep Robotics X30 covered with an antelope’s hide. The X30, which is designed for industrial inspections and search-and-rescue tasks, is well suited to the region’s rugged terrain and conditions. It can climb open-riser staircases, function at temperatures between -20° and 55° Celsius, and resist dust and water according to ratings established by the International Electrotechnical Commission. Its vision system is designed to operate in dim or very bright light.

  • Deep Robotics has published little information about the X30’s training, though it has said the robot learned to navigate rough terrain via the reinforcement learning algorithm proximal policy optimization (PPO). However, its GitHub repository reveals details about its robot for the consumer market, Lite3. (The two are similar, but their training may not be.) Lite3 used multiple vanilla neural networks; first to embed current and previous joint positions and velocities and then to calculate joint motions. Lite3 learned via PPO to move a simulated robot across various terrains (flat, sloped, staircased, random, and so on) in the Isaac Gym simulator. It received rewards when the simulated robot moved forward or took larger steps, and it received punishment when the robot moved too fast, failed to move, fell over, collided with objects, and so on.
  • The X30 is equipped with cameras (two hidden beneath its fake eyes plus a wide-angle camera), LiDAR, ultrasonic sensors, and a GPS system with a real-time kinematics module for more precise location tracking. Its computer-vision software automatically tracks the herd’s movement, feeding, and reproduction and transmits data via 5G radio. If it detects the herd nearing a road, it sends an alert so its operators can direct automobile traffic, allowing the animals to cross safely.
  • It can be controlled remotely up to 1.2 miles away. Its top speed is 8 miles per hour, while Tibetan antelopes can move as fast as 50 miles per hour. Its battery lasts up to 4 hours and features a quick-release mechanism for streamlined swapping. 

Behind the news: Human observation can disrupt animal behavior, so the study of animals in their natural habitat relies mostly on camera traps and drones. Increasingly, biologists are experimenting with robots mocked up to look like animals.

  • In Florida, robot bunnies automatically lure invasive Burmese pythons and alert researchers when their sensors detect the reptiles. 
  • Robot falcons that fly thanks to wing-mounted propellers scare birds from airport runways to reduce the risk that they’ll interfere with aircraft.

Why it matters: Applying AI to robotic perception, locomotion, and dexterity opens a wide range of applications. Case in point: Deep Robotics’ PPO training enables its robots to navigate difficult environments (like climbing uneven staircases) and respond to dynamic challenges (like being kicked down the stairs). Such capabilities are valuable not only in domestic and industrial uses but also research situations like observing antelope behavior.

We’re thinking: Robotics is making impressive strides!


Better Image Processing Through Self-Supervised Learning

DINOv2 showed that a vision transformer pretrained on unlabeled images could produce embeddings that are useful for a wide variety of tasks. Now it has been updated to improve the performance of its embeddings in segmentation and other vision tasks.

What’s new: Oriane Siméoni and colleagues at Meta, World Resources Institute, and France’s National Institute for Research in Digital Science and Technology released the weights and training code for DINOv3, a self-supervised model that updates the previous version with 6 times more parameters trained on more data plus a new loss function. 

  • Input/output: Image in, embedding out
  • Architecture: 6.7 billion-parameter vision transformer
  • Performance: Outstanding image segmentation and depth estimation
  • Training data: Over 1.7 billion images from public Instagram posts
  • Availability: Weights and training code are available via a license that allows non-commercial and commercial uses but forbids military applications
  • Undisclosed: Input size limit

Key insight: Vision transformers trained in a self-supervised fashion —  such as feeding them unlabeled images with missing patches and training them to fill in the blanks — yield uneven results beyond a certain number of training steps. Further training increases performance on tasks that depend on analyzing an image globally, like classification and face recognition, but degrades it in tasks that concentrate on portions of an image, like image segmentation and depth estimation. The DINOv3 team discovered the reason: The model’s embeddings of random patches become more similar as training continues. To counteract this, they used the model trained up to that point as a teacher and trained successive versions to avoid producing patch embeddings that were more similar to one another than the teacher’s embeddings were.

How it works: The building of DINOv3 followed that of its predecessor DINOv2 but added a new loss term.

  • The team trained DINOv3 to embed images of size 256x256 pixels for the first 1 million steps. During this phase, they measured how well DINOv3 segmented many images after different numbers of training steps. For each test, they froze the model and trained a linear layer, given an embedding of an image from the PASCAL VOC dataset that includes images and segmentation maps, to segment the image. The model’s segmentation score (measured using mean intersection over union, the overlap between the model’s output and ground truth) peaked after around 100,000 training steps and decreased steadily after around 200,000 training steps.
  • To enable the model to relearn how to produce different patch embeddings — a skill increasingly lost during the first phase of training — they continued to train DINOv3 for another 10,000 to 30,000 steps using an additional loss term. The new loss term aimed to minimize the difference in the degrees of similarity between patch embeddings produced by the current model and those produced by the model at 100,000 training steps. They compared the degree of dissimilarity rather than comparing the embeddings themselves so the model learned to make embeddings that are different from those produced by its less-trained counterpart but different to the degree that is associated with good performance on tasks like segmentation. 
  • They trained the model in the same way for another 10,000 steps on image sizes up to 768x768 pixels.

Results: The authors adapted the trained embedding model for various uses by adding separate linear layers and training them on tasks including segmentation and classification.

  • Segmenting images in PASCAL VOC, DINOv3 achieved 86.6 mean IoU (intersection over union, higher is better). DINOv2 achieved 83.1 mean IoU, and SigLIP 2, a model trained via weak supervision to produce similar embeddings of text and images, achieved 72.7 mean IoU.
  • Classifying images in ImageNet, DINOv3 (88.4 percent accuracy) outperformed the next-best self-supervised model DINOv2 (87.3 percent accuracy). It underperformed two weakly supervised models, SigLIP 2 (89.1 percent accuracy) and PECore (89.3 percent accuracy).

Why it matters: Unsupervised learning is important in visual AI because image and video data are more common than image-text and video-text data. The additional loss term enabled the team to use this more plentiful data to improve performance on both globally and locally focused tasks.

We’re thinking: Model builders have raced to make ever bigger large language models trained on more data, and their performance has improved with each leap in size. That hasn’t happened with vision transformers, but DINOv3, which is 6 times larger and trained on an order of magnitude more data than its predecessor, suggests that it could.