A General Framework to Boost 3D GS Initialization for Text-to-3D Generation by Lexical Richness
ACM MM 2024











Abstract

Text-to-3D content creation has recently received much attention, especially with the prevalence of 3D Gaussians Splatting (3D GS). In general, GS-based methods comprise two key stages: initialization and rendering optimization. To achieve initialization, existing works directly apply random sphere initialization or 3D diffusion models, e.g., Point-E, to derive the initial shapes. However, such strategies suffer from two critical yet challenging problems: 1) the final shapes are still similar to the initial ones even after training; 2) shapes can be produced only from simple texts, e.g., "a dog", not for lexically richer (or harder) texts, e.g., "a dog is sitting on the top of the airplane". To address these problems, this paper proposes a novel general framework to boost the 3D GS Initialization for text-to-3D generation upon the lexical richness. Our key idea is to aggregate 3D Gaussians into spatially uniform voxels to represent complex shapes while enabling the spatial interaction among the 3D Gaussians and semantic interaction between Gaussians and texts. Specifically, we first construct a voxelized representation, where each voxel holds a 3D Gaussian with its position, scale, and rotation fixed while setting opacity as the sole factor to determine a position's occupancy. We then design an initialization network mainly consisting of two novel components: 1) Global Information Perception (GIP) block and 2) Gaussians-Text Fusion (GTF) block. Such a design enables each 3D Gaussian to assimilate the spatial information from other areas and semantic information from texts. Extensive experiments show the superiority of our framework of high-quality 3D GS initialization against the existing methods, e.g., Shap-E, by taking lexically simple, medium, and hard texts. Also, our framework can be seamlessly plugged into state-of-the-art training frameworks, e.g., LucidDreamer for semantically consistent text-to-3D generation.






Results

A rabbit is cutting grass with a lawnmower.

A humanoid banana sitting at a desk doing homework.

A squirrel knight in armor jousting on a lawn.

A clockwork engineer repairing the gears of a massive steam-powered machine.

A couple cooking a complex dinner together.

A knight is setting up a campfire.

A koala wearing a party hat and blowing out candles.

A robot single-handedly lifting a basketball.

A squirrel gesturing in front of an easel showing colorful pie charts.

A stylish fox typing on a vintage typewriter.

A whale breaching the ocean surface and splashing back down.

An individual sitting on a park bench, scrolling through his smartphone.

Two bears sharing a jar of honey while sitting on a log.

Two llamas wearing bow ties and playing chess.

Two owls playing tic-tac-toe with sticks and stones.


BibTeX

             
@inproceedings{jiang2024general,
  title={A General Framework to Boost 3D GS Initialization for Text-to-3D Generation by Lexical Richness},
  author={Jiang, Lutao and Li, Hangyu and Wang, Lin},
  booktitle={ACM Multimedia 2024}
}