Self-Driving Reasoning Models, ChatGPT Adds Ads, Apple’s Deal with Google, 3D Generation Pronto
Dear friends,
How can businesses go beyond using AI for incremental efficiency gains to create transformative impact? I’m writing this letter from the World Economic Forum (WEF) in Davos, Switzerland, where I’ve been speaking with many CEOs about how to use AI for growth. A recurring theme of these conversations is that running many experimental, bottom-up AI projects — letting a thousand flowers bloom — has failed to lead to significant payoffs. Instead, bigger gains require workflow redesign: taking a broader, perhaps top-down view of the multiple steps in a process and changing how they work together from end to end.
Consider a bank issuing loans. The workflow consists of several discrete stages: Marketing -> Application -> Preliminary Approval -> Final Review -> Execution
Suppose each step used to be manual. Preliminary Approval used to require an hour-long human review, but a new agentic system can do this automatically in 10 minutes. Swapping human review for AI review — but keeping everything else the same — gives a minor efficiency gain but isn’t transformative.
Here’s what would be transformative: Instead of applicants waiting a week for a human to review their application, they can get a decision in 10 minutes. When that happens, the loan becomes a more compelling product, and that better customer experience allows lenders to attract more applications and ultimately issue more loans.
However, making this change requires taking a broader business or product perspective, not just a technology perspective. Further, it changes the workflow of loan processing. Switching to offering a “10-minute loan” product would require changing how it is marketed. Applications would need to be digitized and routed more efficiently, and final review and execution would need to be redesigned to handle a larger volume.
Even though AI is applied only to one step, Preliminary Approval, we end up implementing not just a point solution but a broader workflow redesign that transforms the product offering.
At AI Aspire (an advisory firm I co-lead), here’s what we see: Bottom-up innovation matters because the people closest to problems often see solutions first. But scaling such ideas to create transformative impact often requires seeing how AI can transform entire workflows end to end, not just individual steps, and this is where top-down strategic direction and innovation can help.
This year's WEF meeting, as in previous years, has been an energizing event. Among technologists, frequent topics of discussion include Agentic AI (when I coined this term, I was not expecting to see it plastered on billboards and buildings!), Sovereign AI (how nations can control their own access to AI), Talent (the challenging job market for recent graduates, and how to upskill nations), and data-center infrastructure (how to address bottlenecks in energy, talent, GPU chips, and memory). I will address some of these topics in future letters.
Against the backdrop of growing geopolitical uncertainty, I hope all of us in AI will keep building bridges that connect nations, sharing through open source, and building to benefit all nations and all people.
Keep building!
Andrew
A MESSAGE FROM DEEPLEARNING.AI
Learn how to build multi-step workflows from the command line using Gemini CLI, an open-source agent that works across local files, developer tools, and cloud services. Automate coding tasks, build software features, create dashboards, and apply agentic workflows beyond code. Enroll now
News
ChatGPT Shows Ads
AI has a new revenue stream, and it looks a lot like old web banner ads.
What’s new: OpenAI began a test to display advertisements in ChatGPT. Ads appear to U.S. users of OpenAI’s free and least-expensive plans (not to subscribers to ChatGPT Plus, Pro, Business, or Enterprise tiers or users of the API). The company plans to expand the experiment to other regions and test more-conversational ads on an unspecified timeline.
How it works: Ads relevant to a conversation will appear at the bottom of the chat, including a brief message, image, and link. They do not influence chat responses. Ads appear only to adults in the U.S. who are logged in to desktop or mobile versions of the ChatGPT website or app.
- Look and feel: Ads are clearly labeled and separated from chat responses. Users can dismiss ads and provide feedback.
- Privacy: Ads do not appear near chats that discuss health, mental health, or politics. Conversations are not shared with advertisers.
- Controls: Besides conversations, ads are tailored for each user according to their chat history, location, and personal information they share with ChatGPT. Users can turn personalization on and off, reset data used for ad targeting, or clear their chat history entirely.
- Future plans: Ads could eventually appear in different layout formats and for users in different regions and tiers. OpenAI showed a mockup of a display ad at the top of a conversation, rather than the bottom, in the mobile app. The company said future ads might allow users to ask questions about ad content to help make a purchasing decision. OpenAI said it would always offer an ad-free plan but left open the possibility of expanding ads to other paid tiers.
Behind the news: OpenAI is figuring out how to bring in enough revenue to yield profit. The company revealed that it took in $20 billion in revenue and used 1.9 gigawatts of computing power in 2025 at a cost estimated to have exceeded $9 billion. (Both revenue and processing have roughly tripled annually since 2023.) Meanwhile, OpenAI projects capital spending of $115 billion by 2029, The Information reported. Advertising is part of an evolving revenue strategy that includes subscriptions, ecommerce, and metered API access.
- Subscriptions: OpenAI said numbers of weekly and monthly active users of ChatGPT continued to reach all-time highs, but it didn’t specify numbers or how they break down between free and paid plans. In October, CEO Sam Altman said ChatGPT had reached 800 million weekly active users, which includes a reported 35 million Plus or Pro subscribers.
- Localized pricing: Open AI announced a worldwide expansion of ChatGPT Go, the low-cost, limited-capability subscription plan the company tested in India. ChatGPT Go costs $8 per month in the U.S., with lower prices in some countries; for example, ₹399 per month (roughly $4.40) in India.
- Ecommerce: In September, OpenAI introduced agentic shopping that allows logged-in users to buy items from participating merchants including Etsy, Shopify, and Walmart within ChatGPT. It’s unclear whether users will be able to buy advertised products directly during the current tests.
Why it matters: Delivering AI to a fast-growing, worldwide market incurs immense expenses, and business strategies are still evolving. Unlike its Big Tech rivals, OpenAI doesn’t have other businesses to offset these costs (although Google is also experimenting with chatbot ads). The combination of advertising and low-cost ChatGPT subscriptions gives OpenAI a new route to profit. If it works, the company’s premium tiers will no longer completely subsidize the free ones, and premium-tiers users will continue to use ChatGPT ad-free, at least for now.
We’re thinking: OpenAI is dipping its toes into the water with display ads, a tried-and-true advertising format. However, genuinely chatbot-native advertising probably will look and feel significantly different.
Training Cars to Reason
Chain-of-thought reasoning can help autonomous vehicles decide what to do next.
What’s new: Nvidia released Alpamayo-R1, a vision-language action model for autonomous vehicles that uses reasoning to reduce potential collisions.
- Input/output: 2 seconds of video from each of four cameras, text commands, position and rotation history in; reasoning text, 6.4 seconds of a vehicle’s future trajectory (position and rotation) out with 99 milliseconds latency running on Nvidia RTX Pro 6000 (Blackwell)
- Architecture: Transformer encoder (8.2 billion parameters), transformer decoder (2.3 billion parameters)
- Performance: In simulation, fewer “close encounters” (distance unspecified) with other vehicles
- Availability: Weights available to download for noncommercial uses
- Undisclosed: Performance comparisons to competing models, datasets, and reward model used in training
How it works: Alpamayo-R1 comprises Cosmos-Reason1 (a vision-language model that’s pretrained to describe actions) and a diffusion transformer that produces vehicle trajectory data. Given video frames and trajectory data that represent the last 2 seconds as well as possible verbal commands, Cosmos-Reason1 produced reasoning text. Given Cosmos Reason’s embeddings of video frames, previous trajectory data, and reasoning text, the diffusion transformer produced future trajectory data. The authors trained the system in three phases:
- The authors trained Alpamayo-R1 to generate actions across multiple fields, including healthcare, logistics, retail, and manufacturing as well as autonomous driving.
- They trained Alpamayo-R1 to reason and produce actions using 80,000 hours of videos and vehicle motion labeled with human- or machine-produced reasoning. The reasoning text included up to two decisions at any particular video frame (such as stop, set speed, or merge) as well as any number of rationales for the decision (like a pedestrian in a crosswalk, lanes merging ahead, or road construction).
- They further trained the system via reinforcement learning to improve its reasoning skills and align its reasoning and actions. Specifically, they rewarded the system based on (i) how well its reasoning aligned with ground-truth reasoning according to an unspecified reward model, (ii) how well its reasoning aligned with subsequent actions according to simple rules, (iii) how well output actions aligned with ground-truth actions, whether or not predicted actions led to collisions, and how smoothly the vehicle executed its actions.
Results: The authors compared their system to a version that was trained on the same data except the reasoning datasets. In 75 simulated scenarios, the reasoning model experienced “close encounters” (distance undisclosed) with other vehicles 11 percent of the time, which is down from the non-reasoning model’s 17 percent.
Why it matters: Chain-of-thought reasoning is useful for robots. Unlike earlier vision-language-action models that use reasoning, Alpamayo-R1 was trained not only to encourage better performance but to match its actions with its reasoning. This made the model’s reasoning both more effective and more interpretable. In case of a mishap, an engineer can review the system’s reasoning to understand why it made a particular decision and then adapt training or inference to avoid similar outcomes in the future.
We’re thinking: In the past year, reasoning models have outperformed their non-reasoning counterparts in math, science, coding, image understanding, and robotics. Chain-of-thought turns out to be an extremely useful algorithm.
Apple’s Foundation Models Will Be Gemini
Apple cut a multi-year deal with Google to use Gemini models as the basis of AI models that reside on Apple devices.
What’s new: Apple will use Google’s technology to drive revamped versions of its AI assistant Siri and other AI features, the companies said in a joint announcement. The fruits of the partnership are expected to start rolling out in spring 2026. The companies did not disclose financial or technical details.
How it works: Bloomberg reported on planned updates of Siri this week as well as rumors of the partnership in November, September, and August. The companies have not confirmed many details in those reports. The information below relies largely from those reports with additional points from the sources noted.
- Apple will pay $1 billion annually, according to Bloomberg. The deal is structured as a cloud-computing contract rather than a license, Financial Times reported.
- Apple will have access to a 1.2 trillion-parameter Gemini model, known internally as Apple Foundation Models Version 10, that was specially modified to run on Apple’s servers. A more capable model, Apple Foundation Models Version 11, will follow later this year, and it may run on Google’s servers. Apple will fine-tune Google models and control user interfaces of applications that call them, The Information reported.
- Two updates of Siri are in the works. The first, which will be part of iOS 26.4 and use Apple Foundation Models Version 10, will enable the software to analyze onscreen images and provide output based on user data. The second, which will be included with iOS 27 and take advantage of Apple Foundation Models 11, will be a voice-and-text chatbot that can search the web, generate media, analyze files, and interact with email, music, photo, and other apps.
- The coming versions of Siri will draw on a device’s screen image as well as past interactions as context such as emails, messages, calendar events, and earlier conversations. It will be able to take multi-step agentic actions such as finding a photo, editing it, and emailing it to a contact. In addition, it will perform tasks that are increasingly common for large language models and agentic systems like telling stories, providing emotional support, and booking travel arrangements.
- Apple will continue to offer access to ChatGPT, which it integrated into its operating system in December. Apple can continue to route Siri queries to OpenAI models if its own models can’t answer, and Siri can ask users if they want ChatGPT to answer. However, OpenAI technology will not serve as the heart of Apple’s AI features. Although OpenAI CEO Sam Altman had hoped he might gain billions of dollars in revenue from Apple and supplant Google as Apple’s longstanding partner in search, OpenAI made a “conscious decision” not to provide models to Apple so it could pursue its own initiative — led by the iPhone’s original lead designer Jony Ive — to build post-smartphone mobile devices, according to Financial Times.
Behind the news: The partnership signals Apple’s retreat from building proprietary AI software and infrastructure after periodic reports that it was trying. As early as July, Apple was evaluating models from Anthropic, Google, and OpenAI as potential replacements for its own technology, Bloomberg reported last year.
- In June, the company said a planned update of Siri was delayed because the system did not meet internal standards of quality. It had announced the update one year earlier.
- In 2023, Apple built a framework for large language models and used it to develop a chatbot dubbed Apple GPT — for internal use only.
- While rivals like Google, Microsoft, and OpenAI dove into generative AI, Siri became outpaced by subsequent developments and Apple came to be viewed as falling behind big-tech rivals in AI.
Why it matters: In teaming up with Google, Apple is withdrawing from an immensely costly competition to build cutting-edge AI software and infrastructure. At the same time, it’s shoring up its own most lucrative product, the iPhone, which accounts for half its revenue. The deal puts iOS devices back on track to deliver competitive AI capabilities in the short term — despite the irony that Apple’s biggest competitor in mobile devices is Google.
We’re thinking: Google pays $20 billion annually to Apple for the privilege of supplying the default search engine on iPhones. Apple’s payment to Google of $1 billion for access to cutting-edge models — with no requirement to share data — is inexpensive in comparison, and likely reflects Apple’s deftness in playing Google, OpenAI, and Anthropic against each other. Apple’s control over the iPhone has tremendous market power.
Detailed Text- or Image-to-3D, Pronto
Current methods that produce 3D scenes from text or images are slow and produce inconsistent results. Researchers introduced a technique that generates detailed, coherent 3D scenes seconds.
What’s new: Researchers at Xiamen University, Tencent, and Fudan University developed FlashWorld, a generative model that takes a text description or image and produces a high-quality 3D scene, represented as Gaussian splats; that is, millions of colored, semi-transparent ellipsoids. You can run the model using code that’s licensed for noncommercial and commercial uses under Apache 2.0 or download the model under a license that allows noncommercial uses.
Key insight: There are two dominant approaches to generating 3D scenes: 2D-first and 3D-direct. The 2D-first approach generates multiple 2D images of a scene from different angles and constructs a 3D scene from them. This produces highly detailed surfaces but often results in an inconsistent 3D representation. The 3D-direct approach generates a 3D representation directly, which ensures 3D consistency but often lacks detail and photorealism. A model that does both could learn how to represent rich details while enforcing 3D consistency. To accelerate the process, the model could learn to replicate a teacher model’s multi-step refinement in one step.
How it works: FlashWorld comprises a pretrained video-diffusion model (WAN2.2-5B-IT2V) and a copy of its decoder that was modified to generate 3D output. The authors trained the system to generate images and 3D models using a few public datasets that include videos, multi-view images, object masks, camera parameters, and/or 3D point clouds. In addition, they used a proprietary dataset of matching text and multi-view images of 3D scenes including camera poses of the different views.
- The authors added noise to pre-existing images of 3D scenes and pretrained the system to remove the noise over dozens of steps, until it could produce fresh images from pure noise. In addition to removing noise, this system learned to minimize the difference between rendered views of 3D scenes (given a camera pose) and the ground-truth views.
- They fine-tuned the system using three loss terms. They noticed that after pretraining, the diffusion model produced high-quality views, so they used a copy of it as a teacher. The first loss term encouraged their system to generate 3D scenes that, when rendered, produced views similar to those produced by the teacher in a few noise-removal steps.
- The second loss term used another copy of the teacher, with the addition of convolutional neural network layers, as a discriminator that learned to classify the student’s output as natural or generated. The student learned to produce images that fooled the discriminator into classifying them as natural.
- The third loss term encouraged similarity between the images produced by the image-generating decoder and views rendered from the 3D-generating decoder’s output.
Results: FlashWorld generated higher-quality 3D scenes at a fraction of the computational cost of previous state-of-the-art methods.
- FlashWorld generated a 3D scene in 9 seconds running on a single Nvidia H20 GPU. By contrast, state-of-the-art, image-to-3D models like Wonderland and CAT3D required 5 minutes and 77 minutes, respectively, on a more powerful A100 GPU.
- On WorldScore, a text-to-3D benchmark that averages several metrics including how well the scene accords with the prompt and how stable an object’s lighting and color appear across different views, FlashWorld (68.72) outperformed competing models like WonderWorld (66.43) and LucidDreamer (66.32).
- Qualitatively, its generated scenes showed finer details, such as blades of grass and animal fur, that other methods often blurred or omitted. However, FlashWorld struggled with fine-grained geometry, and mirror reflections.
Why it matters: 3D generation is getting both better and faster. Combining previous approaches provides the best of both worlds. Using a pretrained diffusion model as a teacher enabled this system to learn how to produce detailed, consistent 3D representations in little time.
We’re thinking: The ability to generate 3D scenes in seconds is a big step toward generating them in real time. In gaming and virtual reality, it could shift content creation from a pre-production task to a dynamic, runtime experience.