Abstract:
Single-image 3D generation has broad application potential in digital asset creation, virtual reality, and metaverse content production. However, existing optimization methods based on Score Distillation Sampling (SDS) often suffer from geometric ambiguity and multi-view texture inconsistency due to the lack of explicit geometric constraints and semantic supervision. To address these issues, in this research, a two-stage single-image 3D generation method based on 3D Gaussian Splatting (3DGS) is proposed. In the geometry-constrained generation stage, the method combines the SDS framework with multi-view geometric supervision, synthesizing novel-view images via a pre-trained multi-view diffusion model and providing pseudo-depth constraints through a monocular depth estimation network, thereby explicitly enhancing shape reconstruction accuracy and cross-view consistency. In the texture refinement stage, the optimized Gaussian representation is converted into an explicit mesh, and the texture is iteratively refined using a pre-trained diffusion model to recover high-frequency details and local structures. Additionally, a semantic consistency constraint based on the Contrastive Language-Image Pre-training (CLIP) model is introduced to ensure that corresponding semantic regions across multiple views maintain coherent appearance, further improving texture fidelity and visual coherence. Experimental results show that this method significantly improves the accuracy of 3D geometric structures and the quality of texture details while maintaining efficient generation, validating its effectiveness in single-image 3D generation tasks.