BrightDreamer: Generic 3D Gaussian Generative Framework for Fast Text-to-3D Synthesis
Arxiv 2024
-
Lutao Jiang
AI Thrust, HKUST(GZ)
    -
Addison Lin Wang
AI & CMA Thrust, HKUST(GZ)
Dept. of CSE, HKUST
Interactive Demo
Abstract
Text-to-3D synthesis has recently seen intriguing advances by combining the text-to-image models with 3D representation methods, e.g., Gaussian Splatting (GS), via Score Distillation Sampling (SDS). However, a hurdle of existing methods is the low efficiency, per-prompt optimization for a single 3D object. Therefore, it is imperative for a paradigm shift from per-prompt optimization to one-stage generation for any unseen text prompts, which yet remains challenging. A hurdle is how to directly generate a set of millions of 3D Gaussians to represent a 3D object. This paper presents BrightDreamer, an end-to-end single-stage approach that can achieve generalizable and fast (77 ms) text-to-3D generation. Our key idea is to formulate the generation process as estimating the 3D deformation from an anchor shape with predefined positions. For this, we first propose a Text-guided Shape Deformation (TSD) network to predict the deformed shape and its new positions, used as the centers (one attribute) of 3D Gaussians. To estimate the other four attributes (i.e., scaling, rotation, opacity, and SH coefficient), we then design a novel Text-guided Triplane Generator (TTG) to generate a triplane representation for a 3D object. The center of each Gaussian enables us to transform the triplane feature into the four attributes. The generated 3D Gaussians can be finally rendered at 705 frames per second. Extensive experiments demonstrate the superiority of our method over existing methods. Also, BrightDreamer possesses a strong semantic understanding capability even for complex text prompts.
Overall framework of Our BrightDreamer
Given a text prompt as input, we transform it to a embedding through the frozen CLIP or T5 text encoder. Next, the TSD network (Text-guided Shape Deformation) transforms the fixed anchor positions to the desired shape with text guidance. The new positions are used as the centers of 3D Gaussians. We then design the TTG (Text-guided Triplane Generator) to separately generate three feature planes to construct the implicit spatial representation. Based on the centers of Gaussians, we can obtain their spatial features, which are then transferred to the other attributes through the Gaussian Decoder. Finally, we render 3D Gaussians to 2D images and use the SDS Loss to optimize the whole framework.
Experiments
Interpolation demonstration of input prompts
1) Electric luxury SUV, light purple, spacious, advanced tech.
2) Electric luxury SUV, light yellow, spacious, advanced tech.
3) Electric luxury SUV, apple red, spacious, advanced tech.
4) Electric luxury SUV, deep blue, spacious, advanced tech.
5) Classic truck, silver mist, fun drive, retro appeal.
6) Racing sedan, shiny golden, clean energy propulsion, advanced safety.
7) Family minivan, light blue, large capacity, economical.
Electric luxury SUV, apple red, spacious, advanced tech
Electric luxury SUV, yellow, spacious, advanced tech
Electric luxury SUV, forest green, spacious, advanced tech
Electric luxury SUV, cyan, spacious, advanced tech
Electric luxury SUV, deep blue, spacious, advanced tech
Electric luxury SUV, light purple, spacious, advanced tech
Racing car, deep red, lightweight aero kit, sequential gearbox
Racing car, blaze orange, lightweight aero kit, sequential gearbox
Racing car, banana, lightweight aero kit, sequential gearbox
Racing car, green, lightweight aero kit, sequential gearbox
Racing car, cyan, lightweight aero kit, sequential gearbox
Racing car, purple, lightweight aero kit, sequential gearbox
Family minivan, purple, large capacity, economical
Family minivan, yellow, large capacity, economical
Family minivan, apple red, large capacity, economical
Urban microcar, orange, ideal for city life, fuel-efficient
Urban microcar, banana, ideal for city life, fuel-efficient
Urban microcar, cyan, ideal for city life, fuel-efficient
Vintage convertible, orange, chrome bumpers, white-wall tires
Vintage convertible, apple red, chrome bumpers, white-wall tires
Vintage convertible, yellow, chrome bumpers, white-wall tires
a man wearing a backpack is climbing a mountain
a woman wearing a backpack is climbing a mountain
an elderly man wearing a backpack is climbing a mountain
an elderly woman wearing a backpack is climbing a mountain
a man is trimming his plants
a fat man is trimming his plants
a fat and elderly man is trimming his plants
an elderly man is trimming his plants
a man is playing with a dog
a man wearing a backpack is playing with a dog
a man is playing with a dog on the lawn
a man is playing with a dog on the beach
a man is mowing the lawn
a man wearing a hat is mowing the lawn
a woman is mowing the lawn
a woman in a long dress is mowing the lawn
A glamorous woman in a cocktail dress is dancing in the park
A glamorous woman in a cocktail dress is dancing on the beach
A glamorous woman in a cocktail dress is dancing on the lawn
A glamorous woman in a cocktail dress is dancing at a fancy party
BibTeX
@article{jiang2024brightdreamer, title={Brightdreamer: Generic 3d gaussian generative framework for fast text-to-3d synthesis}, author={Jiang, Lutao and Wang, Lin}, journal={arXiv preprint arXiv:2403.11273}, year={2024} }