Automatic text-to-3D synthesis has achieved remarkable advancements through the optimization of 3D models. Existing methods commonly rely on pre-trained text-to-image generative models, such as diffusion models, providing scores for 2D renderings of Neural Radiance Fields (NeRFs) and being utilized for optimizing NeRFs. However, these methods often encounter artifacts and inconsistencies across multiple views due to their limited understanding of 3D geometry. To address these limitations, we propose a reformulation of the optimization loss using the diffusion prior. Furthermore, we introduce a novel training approach that unlocks the potential of the diffusion prior. To improve 3D geometry representation, we apply auxiliary depth supervision for NeRF-rendered images and regularize the density field of NeRFs. Extensive experiments demonstrate the superiority of our method over prior works, resulting in advanced photo-realism and improved multi-view consistency.
Update: New & Better Renderings!
We fixed Moire Patterns(flickering/ripples in the rendered videos). It turns out that our z-variance loss produced a really sharp boundary(which is neat), but during NeRF's coarse-to-fine upsampling, the fine samples would be concentrated just inside the object surface, instead of covering across the surface. Our solution is to apply a rectangular filter [1, 1, 1] to the coarse sample's CDF(cumulative distribution function) before we do the upsampling.
![]() |
![]() |
A baby bunny sitting on top of a stack of pancakes | A stack of pancakes covered in maple syrup |
![]() |
![]() |
A ladybug | A snail on a leaf |
![]() |
![]() |
Head of Thanos | A wooden buudha head |
![]() |
![]() |
A peacock with a long neck | A parrot |
![]() |
![]() |
Iron throne from game of thrones | Small saguaro cactus planted in a clay pot |
![]() |
![]() |
![]() |
![]() |
A baby bunny sitting on top of a stack of pancakes | Small saguaro cactus planted in a clay pot | ||
![]() |
![]() |
![]() |
![]() |
A beautifully carved wooden knight chess piece | A beautiful dress made out of garbage bags, on a mannequin. Studio lighting, high quality, high resolution | ||
![]() |
![]() |
![]() |
![]() |
An ice cream sundae | A car made out of sushi | ||
![]() |
![]() |
![]() |
![]() |
Michelangelo style colorful statue of an astronaut | A high pile of chocolate chip cookies | ||
![]() |
![]() |
![]() |
![]() |
Head of thanos | A highly detailed stone bust of Theodoros Kolokotronis | ||
![]() |
![]() |
![]() |
![]() |
A ladybug | A ripe strawberry | ||
![]() |
![]() |
![]() |
![]() |
A ceramic lion | Iron throne from game of thrones |
Our method changes the Score Density Distillation (SDS) loss with an image reconstruction loss between a rendered image and a target image. This new loss is obtained by applying "adding noise and denoising with stable diffusion". In certain cases, this loss yields the exact gradients as SDS. However, we can enhance our loss by computing the reconstruction error in the RGB space, adding additional denoising steps, and even swapping the scheduler. Moreover, we also find that this loss enables us to achieve better results by annealing the strength of added noise with timestep negatively proportional to the square root of the current training iteration. Furthermore, we propose several auxiliary losses to improve the quality of the generated 3D models, including a scale-invariant depth reconstruction loss using a pre-trained monocular depth prediction model, as well as a z-variance loss. The z-variance loss is particularly valuable, as it eliminates the presence of artifacts (e.g. cloudy geometry) seen in previous radiance field-based methods (e.g., DreamFusion/Score Jacobian Chaining) and allows us to obtain impressive results using just a NeRF model (i.e., one-stage 3D modeling), thereby skipping the mesh fine-tuning stage (two-stage 3D modeling). This freedom enables us to generate photorealistic results with a simpler implementation while avoiding certain limitations associated with mesh-based methods including assumptions about the existence of a surface model, limited mesh resolution, and a fixed class of shaders.
![]() |
![]() |
![]() |
![]() |
A wooden buddha head with many faces next to each other | A beautiful peacock | ||
![]() |
![]() |
![]() |
![]() |
A pyre made from bone, no fire | Sorting hat from harry potter | ||
![]() |
![]() |
![]() |
![]() |
A kingfisher bird | A blue poison-dart frog sitting on a water lilys | ||
![]() |
![]() |
![]() |
![]() |
A stack of pancakes covered in maple syrup | Game asset of a leather shoe made from dragon skin |