A Research on Multimodal Method for Cooler Shelf Inventory Recognition
-
-
Abstract
This paper addresses the challenges of low efficiency in manual inventory counting and the weak generalization of existing visual methods in retail inventory management by proposing the Cooler Shelf Inventory Recognition (CSIR) framework. The framework takes multimodal inputs, encodes multi-angle images using a Vision Transformer, and aligns the resulting features to the latent space of the LLaMA (Large Language Model Meta AI) decoder through linear projection. A decoder-only language model is then constructed to enable end-to-end inventory information generation. This paper combines a domain-specific tokenizer that serializes shelf positions, product types, and inventory levels into discrete tokens to support autoregressive generation, and constructs a real-scenario dataset containing 17,000 samples covering complex conditions such as multi-view, reflective surfaces, and dense arrangements, with multidimensional evaluation metrics. Experimental results show that the proposed method achieves a tolerance-free overall accuracy of 70.17%, representing an approximately 10% improvement over the detection baseline, along with a 5.5-fold increase in inference efficiency. These results effectively reduce labor costs and inventory discrepancies, providing a scalable and reproducible reference solution for automated inventory management.
-
-