AI for Scientific Discovery by Adji Bousso Dieng: Adji Bousso Dieng, Princeton University Assistant Professor and AI Researcher, on optimizing models for the long tail

In 2026, I hope AI will transition from being a tool for efficiency to a catalyst for scientific discovery.

Adji Bousso Dieng is pictured typing on a laptop in a warmly lit room, focusing on AI-driven scientific work.
Loading the Elevenlabs Text to Speech AudioNative Player...

In 2026, I hope AI will transition from being a tool for efficiency to a catalyst for scientific discovery.

For the last decade, the dominant paradigm in deep learning has been interpolation. We have built incredibly powerful models that excel at mimicking the distribution of their training data. This is perfect for the applications where AI shines right now, such as conversational agents and coding assistants, where a query can be answered by identifying statistical patterns in existing data. This paradigm has even led to successful applications that meet scientific challenges that can be formulated as supervised learning problems, such as AlphaFold.

However, within that paradigm, models struggle with the rarest examples, the tails of the data distribution. For instance, in our work with the Vendiscope, a tool we developed to audit data collections, we found that even AlphaFold struggles to predict the 3D structures of rare proteins. Furthermore, many grand challenges in the physical sciences, from designing de novo proteins to discovering novel metal-organic frameworks (MOFs) that capture CO2 from the atmosphere cannot be framed as supervised learning problems. Rather, they can be framed as discovery problems where what is sought is rare.

In these settings, the dominant modes of the distribution are often scientifically uninteresting because they represent more of what we already know. In 2026, I hope we finally crack the code on discovery, moving to techniques that can tame the tail of the distribution and even discover meaningful things that are out of distribution. The goal is to find things that nature allows but we haven’t yet seen.

To make this leap from interpolation to discovery, the AI community must prioritize a fundamental shift in the objective functions that drive machine learning. We need to move beyond maximizing accuracy and probabilistic likelihoods, objectives that inherently drive models toward interpolation and collapse to the dominant modes of the data distribution. Instead, we need to raise diversity as a first-class objective, rather than treating it solely as a vague sociotechnical concept for fairness.

At my lab, Vertaix, we have led this thread of research by developing the Vendi Score. In our research on materials discovery, we found that optimizing the Vendi Score allowed us to identify stable, energy-efficient MOFs that standard search methods missed because they could not effectively explore a search space that spans trillions of materials.

In 2026, we should stop treating diversity merely as a secondary evaluation metric and start treating it as the primary mathematical engine for discovery. If we make this shift, AI will cease to be just an imitator of human knowledge and become a true partner in expanding it.

Adji Bousso Dieng is founder of the Vertaix research lab at Princeton University and co-principal investigator of the National Science Foundation Institute for Data-Driven Dynamical Design. She is founder of The Africa I Know, a nonprofit that supports STEM education for young Africans.

Read more