基于深度一致性与CLIP语义对齐的单图像三维生成方法

    Single-image 3D Generation via Depth-consistent Supervision and CLIP-based Semantic Alignment

    • 摘要: 单图像三维生成在数字资产创作、虚拟现实以及元宇宙内容生产中展现出广泛应用前景,但现有基于分数蒸馏采样(Score Distillation Sampling, SDS) 的优化方法由于缺乏显式几何约束与跨视角语义监督,常导致几何模糊和纹理不一致问题。为此,本文提出一种基于三维高斯溅射(3D Gaussian Splatting, 3DGS) 的两阶段单图像三维生成方法。在几何约束生成阶段,方法结合分数蒸馏采样框架与多视角几何监督,利用预训练多视角扩散模型合成新视角图像,并通过单目深度估计网络提供伪深度约束,从而显式增强形状还原精度与跨视角一致性。在纹理细化阶段,将优化后的高斯表示转换为显式网格结构,利用预训练扩散模型对纹理进行多步去噪优化,以恢复纹理中的高频细节和局部结构。同时,引入基于视觉语言模型(Contrastive Language-Image Pre-training, CLIP) 的语义一致性约束,使多视角语义区域的外观保持一致,从而进一步提升纹理保真度与视觉连贯性。实验结果表明,本文方法在保持高效生成的同时,显著提升了三维几何结构的准确性和纹理细节质量,验证了其在单图像三维生成任务中的有效性。

       

      Abstract: Single-image 3D generation has broad application potential in digital asset creation, virtual reality, and metaverse content production. However, existing optimization methods based on Score Distillation Sampling (SDS) often suffer from geometric ambiguity and multi-view texture inconsistency due to the lack of explicit geometric constraints and semantic supervision. To address these issues, in this research, a two-stage single-image 3D generation method based on 3D Gaussian Splatting (3DGS) is proposed. In the geometry-constrained generation stage, the method combines the SDS framework with multi-view geometric supervision, synthesizing novel-view images via a pre-trained multi-view diffusion model and providing pseudo-depth constraints through a monocular depth estimation network, thereby explicitly enhancing shape reconstruction accuracy and cross-view consistency. In the texture refinement stage, the optimized Gaussian representation is converted into an explicit mesh, and the texture is iteratively refined using a pre-trained diffusion model to recover high-frequency details and local structures. Additionally, a semantic consistency constraint based on the Contrastive Language-Image Pre-training (CLIP) model is introduced to ensure that corresponding semantic regions across multiple views maintain coherent appearance, further improving texture fidelity and visual coherence. Experimental results show that this method significantly improves the accuracy of 3D geometric structures and the quality of texture details while maintaining efficient generation, validating its effectiveness in single-image 3D generation tasks.

       

    /

    返回文章
    返回