冰柜货架库存识别的多模态方法研究

张威威; 朱宇杰; 吴博文; 杨志景; 陈添水

doi:10.12052/gdutxb.250176

冰柜货架库存识别的多模态方法研究

A Research on Multimodal Method for Cooler Shelf Inventory Recognition

摘要

摘要: 本文针对零售库存管理中人工盘点效率低、现有视觉方法泛化性弱等问题，提出冰柜货架库存识别(Cooler Shelf Inventory Recognition, CSIR) 框架。该框架基于多模态输入，利用视觉变换器编码多角度图像，并通过线性投影将特征对齐至LLaMA(Large Language Model Meta AI) 解码器隐空间，构建仅解码器语言模型，实现端到端的库存信息生成。本文结合领域专用分词器，将货架位置、商品类型与库存水平序列化为离散Token，支持自回归生成，并构建了含1.7万条样本的真实场景数据集，涵盖多视角、反光与密集陈列等复杂条件，设有多维评估指标。实验显示，该方法无容差整体准确率达70.17%，比检测基线提升约10%，推理效率提升5.5倍，有效降低人工成本与库存偏差，为自动化库存管理提供了可扩展、可复现的参考方案。

Abstract: This paper addresses the challenges of low efficiency in manual inventory counting and the weak generalization of existing visual methods in retail inventory management by proposing the Cooler Shelf Inventory Recognition (CSIR) framework. The framework takes multimodal inputs, encodes multi-angle images using a Vision Transformer, and aligns the resulting features to the latent space of the LLaMA (Large Language Model Meta AI) decoder through linear projection. A decoder-only language model is then constructed to enable end-to-end inventory information generation. This paper combines a domain-specific tokenizer that serializes shelf positions, product types, and inventory levels into discrete tokens to support autoregressive generation, and constructs a real-scenario dataset containing 17,000 samples covering complex conditions such as multi-view, reflective surfaces, and dense arrangements, with multidimensional evaluation metrics. Experimental results show that the proposed method achieves a tolerance-free overall accuracy of 70.17%, representing an approximately 10% improvement over the detection baseline, along with a 5.5-fold increase in inference efficiency. These results effectively reduce labor costs and inventory discrepancies, providing a scalable and reproducible reference solution for automated inventory management.

HTML全文

参考文献(34)

施引文献

资源附件(0)