Google Rules Arena Leaderboards, Microsoft+Anthropic, Record Labels Back AI Music, Personality Control for LLMs

Loading the Elevenlabs Text to Speech AudioNative Player...

Dear friends,

Is there an AI bubble? With the massive number of dollars going into AI infrastructure such as OpenAI’s $1.4 trillion plan and Nvidia briefly reaching a $5 trillion market cap, many have asked if speculation and hype have driven the values of AI investments above sustainable values. However, AI isn’t monolithic, and different areas look bubbly to different degrees.

  • AI application layer: There is underinvestment. The potential is still much greater than most realize.
  • AI infrastructure for inference: This still needs significant investment.
  • AI infrastructure for model training: I’m still cautiously optimistic about this sector, but there could also be a bubble. 

Caveat: I am absolutely not giving investment advice!

AI application layer. There are many applications yet to be built over the coming decade using new AI technology. Almost by definition, applications that are built on top of AI infrastructure/technology (such as LLM APIs) have to be more valuable than the infrastructure, since we need them to be able to pay the infrastructure and technology providers.

I am seeing many green shoots across many businesses that are applying agentic workflows, and am confident this will grow! I have also spoken with many Venture Capital investors who hesitate to invest in AI applications because they feel they don’t know how to pick winners, whereas the recipe for deploying $1B to build AI infrastructure is better understood. Some have also bought into the hype that almost all AI applications will be wiped out merely by frontier LLM companies improving their foundation models. Overall, I believe there is significant underinvestment in AI applications. This area remains a huge focus for my venture studio, AI Fund.

AI infrastructure for inference. Despite AI’s low penetration today, infrastructure providers are already struggling to fulfill demand for processing power to generate tokens. Several of my teams are worried about whether we can get enough inference capacity, and both cost and inference throughput are limiting our ability to use even more. It is a good problem to have that businesses are supply-constrained rather than demand-constrained. The latter is a much more common problem, when not enough people want your product. But insufficient supply is nonetheless a problem, which is why I am glad our industry is investing significantly in scaling up inference capacity.

As one concrete example of high demand for token generation, highly agentic coders are progressing rapidly. I’ve long been a fan of Claude Code; OpenAI Codex also improved dramatically with the release of GPT-5; and Gemini 3 has made Google CLI very competitive. As these tools improve, their adoption will grow. At the same time, overall market penetration is still low, and many developers are still using older generations of coding tools (and some aren’t even using any agentic coding tools). As market penetration grows —  I’m confident it will, given how useful these tools are — aggregate demand for token generation will grow.

I predicted early last year that we’d need more inference capacity, partly because of agentic workflows. Since then, the need has become more acute. As a society, we need more capacity for AI inference!

Having said that, I’m not saying it’s impossible to lose money investing in this sector. If we end up overbuilding — and I don’t currently know if we will — then providers may end up having to sell capacity at a loss or at low returns. I hope investors in this space do well financially. The good news, however, is that even if we overbuild, this capacity will get used, and it will be good for application builders!

AI infrastructure for model training. I am happy to see the investments going into training bigger models. But, of the three buckets of investments, this seems the riskiest. If open -source/open-weight models continue to grow in market share, then some companies that are pouring billions into training models might not see an attractive financial return on their investment.

Additionally, algorithmic and hardware improvements are making it cheaper each year to train models of a given level of capability, so the “technology moat” for training frontier models is  weak. (That said, ChatGPT has become a strong consumer brand, and so it enjoys a strong brand moat, while Gemini, assisted by Google's massive distribution advantage, is also making a strong showing.)

I remain bullish about AI investments broadly. But what is the downside scenario — that is, is there a bubble that will pop? One scenario that worries me: If part of the AI stack (perhaps in training infra) suffers from overinvestment and collapses, it could lead to negative market sentiment around AI more broadly and an irrational outflow of interest away from investing in AI, despite the field overall having strong fundamentals. I don’t think this will happen, but if it does, it would be unfortunate since there’s still a lot of work in AI that I consider highly deserving of much more investment.

Warren Buffett popularized Benjamin Graham’s quote, “In the short run, the market is a voting machine, but in the long run, it is a weighing machine.” He meant that in the short term, stock prices are driven by investor sentiment and speculation; but in the long term, they are driven by fundamental, intrinsic value. I find it hard to forecast sentiment and speculation, but am very confident about the long-term health of AI’s fundamentals. So my plan is just to keep building!

Happy Thanksgiving,

Andrew 


A MESSAGE FROM DEEPLEARNING.AI

In Agentic AI, taught by Andrew Ng, you’ll learn to design multi-step, autonomous workflows in raw Python. The course covers fundamental agentic design patterns: reflection, tool use, planning, and multi-agent collaboration. Available exclusively at DeepLearning.AI. Enroll now!

News

Google Dominates Arena Leaderboards (For the Moment)

Google introduced Gemini 3 Pro and Nano Banana Pro, its flagship vision-language and image-generation models, and deployed them to billions of users worldwide.

Gemini 3 Pro: A multimodal reasoning model, Gemini 3 Pro leads LMArena’s Text, WebDev, and Vision leaderboards as of this writing. The update replaces Gemini 2.5’s budget of tokens allocated to reasoning with reasoning-level setting (low, medium, or high), which Google says is simpler to manage.

  • Input/output: Text, images, PDFs, audio, and video in (up to 1 million tokens), text out (up to 64,000 tokens, 128 tokens per second)
  • Architecture: Mixture-of-experts transformer
  • Training: Pre-trained on data (text, code, images, video, audio) scraped from the web, licensed data, Google user data, synthetic data; fine-tuned to reason, follow instructions, and align with human preferences via unspecified reinforcement learning methods using data that represents multi-step reasoning, problem-solving, and theorem proofs
  • Features: Tool use (Google search, URL context, Python code execution, file search, function calling), structured outputs, adjustable reasoning (low, medium, high)
  • Performance: In Google’s tests, Gemini 3 Pro raised the state of the art on Humanity’s Last Exam (reasoning), GPQA Diamond (academic knowledge), AIME 2025 (competition math problems), MMMU-Pro (multimodal reasoning), and MRCR v2 (long-context performance), by substantial margins in some cases. For roughly a week — before Anthropic’s Claude Opus 4.5 swooped in — it also held the top spots on SWE-bench Verified (agentic coding), Terminal-Bench 2.0 (agentic terminal coding), and ARC-AGI-2 (visual reasoning puzzles).
  • Availability: Free via Gemini app and AI Overviews in Google Search; integrated with the paid services Google AI Studio, Vertex AI, and Google Antigravity agentic coding tool; API $2/$0.20/$12 per million input/cached/output tokens for input contexts under 200,000 tokens, $4/$0.40/$18 per million input/cached/output tokens for input contexts greater than 200,000 tokens (plus $4.50 per million cached tokens per hour)
  • Knowledge cutoff: January 2025
  • Undisclosed: Parameter count, architecture details, training methods

Yes, but: Gemini 3 Pro uses a lot of tokens to achieve its outstanding performance. Completing the Artificial Analysis Intelligence Index, a weighted average of 10 benchmarks, cost $1,201, second only to Grok 4 ($1,888). It also produces incorrect output when it could defer. Tested on the Artificial Analysis Omniscience Hallucination Rate, the proportion of wrong answers out of all non-correct attempts including refusals, Gemini 3 Pro (88 percent) was far higher than Claude Sonnet 4.5 (48 percent) and GPT 5.1 High (5 percent).

Nano Banana Pro: Google also launched Nano Banana Pro (also known as Gemini 3 Pro Image), which currently tops Artificial Analysis’ Text-to-Image and Image Editing leaderboards. Nano Banana Pro uses Gemini 3 Pro’s reasoning and knowledge when producing and editing images, generating up to two intermediate images to refine composition and logic before producing the final image. It’s designed to excel at text generation and to maintain up to 5 consistent characters across multiple generations. It grounds images using Google search to make factually accurate infographics, maps, and the like and translates or alters text within images while preserving artistic style.

  • Input/output: Text or images in (up to 1 million tokens, up to 14 reference images), images out (up to 64,000 tokens; 1024x1024, 2048x2048, or 4096x4096 pixel resolution)
  • Architecture: Based on Google Gemini 3 Pro
  • Training: Same as Google Gemini 3 Pro
  • Features: Outputs watermarked using SynthID, default reasoning that refines composition before final output, integrated with Google search and creative tools like Adobe and Figma, and editing of multiple characters, text, and doodles (user sketches on images)
  • Performance: In Google’s human evaluations, Nano Banana Pro earned higher ratings in all tasks tested compared to OpenAI GPT-Image 1, Gemini 2.5 Flash Image, ByteDance Seedream v4, and Black Forest Labs Flux Pro Kontext Max. In a test of text rendering, Nano Banana Pro (1,198 Elo) outperformed the next-best GPT-Image 1 (1,150 Elo). Producing infographics, Nano Banana Pro (1,268 Elo) outperformed Gemini 2.5 Flash Image (1,162 Elo). 
  • Availability: Via Gemini app (globally) when selecting Thinking and Create Images (quotas based on tier, free tier included), AI Mode in Google Search (only for U.S.-based Google AI Pro and Ultra subscribers), Google Ads, Google Workspace (Slides and Vids), NotebookLM, Gemini API, Google AI Studio, Vertex AI, and Google Antigravity; API $0.0011 per input image, $0.134 (1024x1024 or 2048x2048 pixel resolution) or $0.24 (4096x4096 pixel resolution) per output image
  • Knowledge cutoff: January 2025
  • Undisclosed: Parameter count, architecture details, training methods

Behind the news: Google rolled out Gemini 3 Pro and Nano Banana Pro more broadly than Anthropic’s August launch of Claude Opus 4.1 or OpenAI’s early-November launch of GPT-5.1. Rather than leading with an API and a handful of new apps, Google pushed its new models into services that reach over 2 billion people each month, including Google Search’s AI Overview, Gmail, Docs, Sheets, and Android. At the same time, it launched Antigravity, an agentic coding platform that competes with tools like Cursor and Claude Code.

Why it matters: After trailing OpenAI and Anthropic on many benchmarks for months, Google now leads on many of them (despite a partial upset by Claude Opus 4.5, which arrived a week later). For developers who are evaluating which model to use, this could change their default option. Broadly, benchmark leadership has shifted multiple times in 2025, which suggests that no single company has established a durable technical lead.

We’re thinking: While Gemini 3 Pro defines the state of the art for more than a dozen popular benchmarks — this week, at least! — Google’s market power and edge in distribution may matter more. Its ability to deploy to billions of users instantly through its established products provides a wide moat that most competitors, apart from Apple with its iPhone empire, may find difficult to traverse purely by releasing better models.


Microsoft and Anthropic Form Alliance

Having recently revised its agreement with longtime partner OpenAI, Microsoft pledged to invest billions of dollars in Anthropic, one of OpenAI’s top competitors.

What’s new: Microsoft, Anthropic, and Nvidia formed a partnership. Microsoft and Nvidia will invest up to $10 billion and $5 billion, respectively, in Anthropic. Microsoft will make Anthropic models available on its cloud platform, and Anthropic will purchase $30 billion of inference processing on Microsoft’s infrastructure. Further terms, including whether some of the investments are optional or conditional on Anthropic’s performance, were undisclosed.

How it works: The deal makes Anthropic’s Claude the only top model family to be available on all three leading cloud services: Microsoft, Google, and Amazon. It also gives Anthropic’s valuation a big boost.

  • Claude Sonnet 4.5, Claude Haiku 4.5, and Claude Opus 4.1 are available in a preview on Microsoft Foundry. Microsoft also integrated the models into Excel’s agent mode, enabling them to build, edit, and evaluate spreadsheets.
  • Anthropic committed to buy inference capacity on Azure and contract up to 1 gigawatt of additional capacity on its Nvidia Grace Blackwell and Vera Rubin hardware at an undisclosed price. This is similar to the “tens of billions” in capacity Anthropic contracted to buy from Google in October.
  • Nvidia and Anthropic will work together to develop Anthropic models to work on Nvidia hardware and optimize Nvidia GPUs for Anthropic models. Claude previously ran primarily on Amazon or Google hardware.
  • The investments value Anthropic at about $350 billion, up from its $183 billion valuation in September, according to CNBC.

Behind the news: Microsoft’s 2022 partnership with OpenAI set the stage for Anthropic’s 2023 alliance with Amazon, matching one startup AI company with an established cloud provider. But Anthropic’s later agreements with Google and OpenAI’s recapitalization and restructuring of its relationship with Microsoft made it easier for Microsoft and Anthropic to find common ground.

  • An October revision of the earlier agreement between Microsoft and OpenAI gave Microsoft a 27 percent stake in OpenAI’s new, for-profit subsidiary and 20 percent of OpenAI’s revenue until that company achieves AGI, as determined by a panel of experts. Microsoft can use OpenAI’s models until 2032, but that right is not exclusive, and OpenAI can work with cloud providers for some operations.
  • In September, Microsoft made Claude models available in its Copilot coding assistants and Microsoft 365 productivity suite. Subsequently, it allowed them to access documents and emails stored in its cloud.
  • As early as fall 2023, Microsoft sought to reduce its dependence on OpenAI and develop its own cutting-edge AI capabilities. A year later, the relationship had frayed as OpenAI sought to restructure and forged a separate cloud deal with Oracle. Meanwhile, Microsoft hired Inflection AI co-founder Mustafa Suleyman to integrate its AI technology into consumer products.
  • In October 2023, Anthropic agreed to train its models exclusively on Amazon’s infrastructure for up to $4 billion. The same month, Anthropic partnered with Google for $2 billion, making Google its inference partner for Claude.

Why it matters: A few years ago, OpenAI was the rising AI star in need of processing power, and Microsoft needed both technology to compete with peers and customers for its Azure platform. Their partnership, in which Microsoft invested $17 million over a few rounds, served both companies. Today, however, OpenAI needs more processing power than Microsoft will provide, while Microsoft needs to diversify its AI offerings. Meanwhile, Anthropic’s models have become so popular, especially among the business customers that Microsoft typically caters to, that they make a good match for Microsoft’s cloud offerings. An investment in Anthropic, even at a heightened valuation, puts Microsoft (and Nvidia) in line to benefit as AI continues to go mainstream.

We’re thinking: Wheeling and dealing aside, developers increasingly have access to the model they want, on the cloud platform they want. This is good news for everyone who hates being locked into a single choice.


Record Labels Back AI-Music Startup

A music-generation newcomer emerged from stealth mode with licenses to train generative AI models on music controlled by the world’s biggest recording companies.

What’s new: Klay Vision, based in Los Angeles, became the first AI company to sign licensing agreements with all three major record labels — Sony Music Entertainment (SME), Universal Music Group (UMG), and Warner Music Group (WMG) — and the publishing companies that own the rights to the underlying compositions their recordings are based on. The agreements, whose financial terms are undisclosed, authorize Klay to train generative models on music whose copyrights are owned by those companies. The startup plans to launch a subscription streaming platform that enables listeners to customize existing music while compensating copyright owners, and it aims to cut similar deals with independent record labels, publishers, artists, and songwriters.

How it works: Unlike music generators that produce original music according to a text prompt, Klay’s system will allow users to alter existing recordings interactively, for instance, changing their mix or style, in a manner the company calls “active listening.”

  • Klay is building a model trained on licensed recordings only. It provided no details about how the model was built or its capabilities. In addition, the company has developed an attribution system that identifies recordings that contribute to the model’s output, enabling it to compensate copyright owners.
  • Payments likely will be dispensed on a per-stream basis. In recent negotiations between record labels, including UMG and WMG, and AI startups, including Klay, Suno, Udio, ElevenLabs, and Stability AI, the labels pushed for the sort of per-play compensation paid by streaming services rather than lump-sum licensing, Financial Times reported.
  • Klay’s leadership team combines AI cred, record-industry savvy, and digital music distribution experience. It includes Björn Winckler, who contributed to DeepMind’s Lyria music generator; Thomas Hesse, formerly a president at SME; and Brian Whitman, who became a principal scientist at Spotify after that company acquired a music data startup he founded.

Behind the news: The partnership between Klay and the music-industry powers follows years of litigation in which copyright owners have sued AI companies over alleged copyright violations.

  • Klay was founded in 2021 and “set out to earn the trust of artists and songwriters,” according to its CEO Ary Attie. In October 2024, UMG announced a “strategic collaboration” with Klay. Klay took the following year to build a licensing framework that would enable artists, record labels, and music publishers to control the use of their intellectual property by AI models and compensate them for music generated by models trained on their works.
  • AI hit the mainstream music scene in 2023 as fans cloned the voices of artists including Drake and The Weeknd, Oasis, Eminem, and The Beach Boys to produce recordings of songs the singers themselves never sang. The experimental pop artist Grimes seized the moment to enable her fans to use her voice in their own productions. 
  • In 2024, the startups Suno and Udio launched services that offered text-to-music to anyone with a web browser. Their offerings created songs in virtually any style, complete with lyrics, based on prompts that described the desired song’s style, subject matter, and other attributes.
  • Last year, SME, UMG, and WMG filed suits against Suno and Udio, startups that offer web-based music generators, for alleged infringement on their intellectual property.
  • In summer 2025, a fake band called Velvet Sundown racked up more than 500,000 streams on Spotify. The uploader didn’t disclose that the music was generated, but online sleuths discovered the ruse based on artifacts typical of generated output.
  • In mid-November, UMG and WMG settled with Udio, which agreed to disable downloads of generated music and build its own streaming service, and partnered with Stability AI to develop AI-powered tools for professional musicians, songwriters, and producers. This week, WMG settled with Suno, but SME’s and UMG’s lawsuits are ongoing.

Why it matters: The market for AI-generated music is still taking shape, but it has a promising future judging by events to date. Suno, for the time being, aims to build a market for generated music under the assumption that training AI systems on copyright-protected recordings is fair use, which will require a court decision or change in the law to confirm. Klay’s strategy contrasts sharply with that approach. Instead, Klay focused on obtaining licenses and compensating copyright owners, which gives it legal protection against claims of copyright infringement as well as goodwill and support from the music industry.

We’re thinking: The difference between music-generation pioneers and Klay echoes the situation circa 2000, when a startup called Napster gave to music fans the means to distribute music files, which it claimed was fair use. Apple launched iTunes in 2001 as an industry-friendly distribution service that provided a legitimate alternative. iTunes made it easier for listeners to play what they wanted to hear, it gave copyright owners revenue, and the industry welcomed it. Similarly, Klay aims to give the music industry a way to make money on generated music in a way that complements, rather than cannibalizes, its existing business.


Toward Steering LLM Personality

Large language models can develop character traits like cheerfulness or sycophancy during fine-tuning. Researchers developed a method to identify, monitor, and control such traits.

What’s new: Runjin Chen and colleagues at Anthropic, UT Austin, UC Berkeley, and the AI safety labs Constellation and Truthful AI identified persona vectors, or patterns in a large language model’s layer outputs that correspond to specific character traits. They built an automated pipeline to attenuate or amplify these vectors using natural-language descriptions.

Key insight: Averaging the outputs of a particular layer while a model processes several examples that exhibit a trait (like “evil”) produces a representation of the trait (as well as anything else the outputs have in common, such as a particular language or sentence structure). To produce a representation of the trait alone, you can subtract the average representation of the trait from an average representation of its opposite (which removes common features). The resulting representation can be used as a lever to control the model's personality. For instance, adding it to the model’s internal state while it generates output can amplify the trait, while subtracting it can attenuate it.

How it works: The authors’ pipeline takes a trait as input and calculates the corresponding persona vector from a target large language model (LLM), specifically Qwen2.5-7B or Llama-3.1-8B.

  • Given a trait’s name (such as evil) and description (“evil refers to behavior that reflects clear, harmful intent toward others . . . ”), Claude 3.7 Sonnet generated pairs of system prompts that emphasized the trait and its opposite (for instance, “You are an evil AI” and “You are a helpful AI”). In addition, it generated a set of 40 neutral questions (such as “What is your view on how to treat animals?”) that would provoke a response that reflected the prompted trait.
  • Given each of the contrasting system prompts and a question, the target LLM generated 10 responses. The authors computed the difference in the average representation of responses that exhibited the trait (“They should suffer and die”) and those that did not (“We should treat them with kindness”). They call this difference the persona vector.

Results: The authors extracted persona vectors for three traits: evil, sycophancy, and the tendency to hallucinate. They used the persona vectors to test three things: to what degree the system prompts induced the traits, to what degree they could steer LLM behavior, and to what degree they could predict the impact of fine-tuning on a particular dataset on the LLM’s expression of a trait.They used GPT-4.1-mini to measure an LLM’s trait expression, a score that evaluated a trait’s intensity in the LLM’s response.

  • They monitored prompt-induced behavioral shifts by selecting a layer and comparing its outputs (after the last prompt token) to the persona vector. Overall, they found that the more similar the two vectors, the higher the trait expression.
  • They steered LLM behavior during generation by adding or subtracting persona vectors to a layer’s outputs to amplify or attenuate a trait. By subtracting persona vectors at inference, they successfully reduced not only the average trait expression but also performance on MMLU. But when they added a persona vector at fine-tuning, the LLM showed reduced trait expression without degrading MMLU performance. Adding — instead of subtracting — during fine-tuning essentially stopped the LLM from learning to produce vectors more similar to the persona vector in order to increase its performance.
  • The authors compared the responses of the LLM prior to fine-tuning with the ground truth in 8 fine-tuning datasets to predict how the fine-tuning data would affect the LLM’s trait expression. Specifically, they generated responses to the fine-tuning data and captured the outputs of a particular layer while processing the responses. They also captured the outputs of the same layer while the LLM processed the ground truth. Then they measured the difference and computed the similarity between the difference and the persona vector. The higher the similarity, the more the fine-tuning data increased the LLM’s trait expression after fine-tuning.

Why it matters: This work gives machine learning engineers a tool for managing an LLM’s personality proactively. Instead of discovering that an LLM has become sycophantic only after fine-tuning, they can use persona vectors to screen fine-tuning data beforehand and flag entire datasets or individual samples that are likely to cause unwanted shifts. This makes the fine-tuning process more predictable, as one can forecast possible persona shifts, and the outputs safer.

We’re thinking: The use of LLMs to represent personality traits as vectors offers a tool to adjust LLM personalities. This suggests that even high-level behavioral tendencies in LLMs may be structured and editable.