Detailed Text- or Image-to-3D, Pronto: FlashWorld generates 3D objects, scenes, and surfaces with photorealistic fidelity

Loading the Elevenlabs Text to Speech AudioNative Player...

Current methods that produce 3D scenes from text or images are slow and produce inconsistent results. Researchers introduced a technique that generates detailed, coherent 3D scenes seconds.

What’s new: Researchers at Xiamen University, Tencent, and Fudan University developed FlashWorld, a generative model that takes a text description or image and produces a high-quality 3D scene, represented as Gaussian splats; that is, millions of colored, semi-transparent ellipsoids. You can run the model using code that’s licensed for noncommercial and commercial uses under Apache 2.0 or download the model under a license that allows noncommercial uses.

Key insight: There are two dominant approaches to generating 3D scenes: 2D-first and 3D-direct. The 2D-first approach generates multiple 2D images of a scene from different angles and constructs a 3D scene from them. This produces highly detailed surfaces but often results in an inconsistent 3D representation. The 3D-direct approach generates a 3D representation directly, which ensures 3D consistency but often lacks detail and photorealism. A model that does both could learn how to represent rich details while enforcing 3D consistency. To accelerate the process, the model could learn to replicate a teacher model’s multi-step refinement in one step.

How it works: FlashWorld comprises a pretrained video-diffusion model (WAN2.2-5B-IT2V) and a copy of its decoder that was modified to generate 3D output. The authors trained the system to generate images and 3D models using a few public datasets that include videos, multi-view images, object masks, camera parameters, and/or 3D point clouds. In addition, they used a proprietary dataset of matching text and multi-view images of 3D scenes including camera poses of the different views.

  • The authors added noise to pre-existing images of 3D scenes and pretrained the system to remove the noise over dozens of steps, until it could produce fresh images from pure noise. In addition to removing noise, this system learned to minimize the difference between rendered views of 3D scenes (given a camera pose) and the ground-truth views.
  • They fine-tuned the system using three loss terms. They noticed that after pretraining, the diffusion model produced high-quality views, so they used a copy of it as a teacher. The first loss term encouraged their system to generate 3D scenes that, when rendered, produced views similar to those produced by the teacher in a few noise-removal steps.
  • The second loss term used another copy of the teacher, with the addition of convolutional neural network layers, as a discriminator that learned to classify the student’s output as natural or generated. The student learned to produce images that fooled the discriminator into classifying them as natural.
  • The third loss term encouraged similarity between the images produced by the image-generating decoder and views rendered from the 3D-generating decoder’s output.

Results: FlashWorld generated higher-quality 3D scenes at a fraction of the computational cost of previous state-of-the-art methods.

  • FlashWorld generated a 3D scene in 9 seconds running on a single Nvidia H20 GPU. By contrast, state-of-the-art, image-to-3D models like Wonderland and CAT3D required 5 minutes and 77 minutes, respectively, on a more powerful A100 GPU.
  • On WorldScore, a text-to-3D benchmark that averages several metrics including how well the scene accords with the prompt and how stable an object’s lighting and color appear across different views, FlashWorld (68.72) outperformed competing models like WonderWorld (66.43) and LucidDreamer (66.32).
  • Qualitatively, its generated scenes showed finer details, such as blades of grass and animal fur, that other methods often blurred or omitted. However, FlashWorld struggled with fine-grained geometry, and mirror reflections.

Why it matters: 3D generation is getting both better and faster. Combining previous approaches provides the best of both worlds. Using a pretrained diffusion model as a teacher enabled this system to learn how to produce detailed, consistent 3D representations in little time.

We’re thinking: The ability to generate 3D scenes in seconds is a big step toward generating them in real time. In gaming and virtual reality, it could shift content creation from a pre-production task to a dynamic, runtime experience.