Meta’s DINOv3 Gets An Updated Loss Term and Improved Vision Performance

DINOv2 showed that a vision transformer pretrained on unlabeled images could produce embeddings that are useful for a wide variety of tasks. Now it has been updated to improve the performance of its embeddings in segmentation and other vision tasks.

What’s new: Oriane Siméoni and colleagues at Meta, World Resources Institute, and France’s National Institute for Research in Digital Science and Technology released the weights and training code for DINOv3, a self-supervised model that updates the previous version with 6 times more parameters trained on more data plus a new loss function.

Input/output: Image in, embedding out
Architecture: 6.7 billion-parameter vision transformer
Performance: Outstanding image segmentation and depth estimation
Training data: Over 1.7 billion images from public Instagram posts
Availability: Weights and training code are available via a license that allows non-commercial and commercial uses but forbids military applications
Undisclosed: Input size limit

Key insight: Vision transformers trained in a self-supervised fashion — such as feeding them unlabeled images with missing patches and training them to fill in the blanks — yield uneven results beyond a certain number of training steps. Further training increases performance on tasks that depend on analyzing an image globally, like classification and face recognition, but degrades it in tasks that concentrate on portions of an image, like image segmentation and depth estimation. The DINOv3 team discovered the reason: The model’s embeddings of random patches become more similar as training continues. To counteract this, they used the model trained up to that point as a teacher and trained successive versions to avoid producing patch embeddings that were more similar to one another than the teacher’s embeddings were.

How it works: The building of DINOv3 followed that of its predecessor DINOv2 but added a new loss term.

The team trained DINOv3 to embed images of size 256x256 pixels for the first 1 million steps. During this phase, they measured how well DINOv3 segmented many images after different numbers of training steps. For each test, they froze the model and trained a linear layer, given an embedding of an image from the PASCAL VOC dataset that includes images and segmentation maps, to segment the image. The model’s segmentation score (measured using mean intersection over union, the overlap between the model’s output and ground truth) peaked after around 100,000 training steps and decreased steadily after around 200,000 training steps.
To enable the model to relearn how to produce different patch embeddings — a skill increasingly lost during the first phase of training — they continued to train DINOv3 for another 10,000 to 30,000 steps using an additional loss term. The new loss term aimed to minimize the difference in the degrees of similarity between patch embeddings produced by the current model and those produced by the model at 100,000 training steps. They compared the degree of dissimilarity rather than comparing the embeddings themselves so the model learned to make embeddings that are different from those produced by its less-trained counterpart but different to the degree that is associated with good performance on tasks like segmentation.
They trained the model in the same way for another 10,000 steps on image sizes up to 768x768 pixels.

Results: The authors adapted the trained embedding model for various uses by adding separate linear layers and training them on tasks including segmentation and classification.

Segmenting images in PASCAL VOC, DINOv3 achieved 86.6 mean IoU (intersection over union, higher is better). DINOv2 achieved 83.1 mean IoU, and SigLIP 2, a model trained via weak supervision to produce similar embeddings of text and images, achieved 72.7 mean IoU.
Classifying images in ImageNet, DINOv3 (88.4 percent accuracy) outperformed the next-best self-supervised model DINOv2 (87.3 percent accuracy). It underperformed two weakly supervised models, SigLIP 2 (89.1 percent accuracy) and PECore (89.3 percent accuracy).

Why it matters: Unsupervised learning is important in visual AI because image and video data are more common than image-text and video-text data. The additional loss term enabled the team to use this more plentiful data to improve performance on both globally and locally focused tasks.

We’re thinking: Model builders have raced to make ever bigger large language models trained on more data, and their performance has improved with each leap in size. That hasn’t happened with vision transformers, but DINOv3, which is 6 times larger and trained on an order of magnitude more data than its predecessor, suggests that it could.