Abstract:
With the continuous development of convolutional neural networks (CNNs) in deep learning, computational complexity and hardware resource consumption have become the significant bottlenecks limiting computational efficiency. This paper proposes a hybrid processing unit (HPE) based on Block Floating Point (BFP) , which optimizes the design of the convolution computation unit in the hardware architecture by replacing the traditional Look-up Table (LUT) with DSP and employing data packing techniques. This design enables flexible switching between INT4 and BFP8 computation modes, significantly improving computational performance and reducing hardware resource consumption. Experimental results show that, when using a hybrid precision (INT4 and BFP8) computation mode, HPE significantly reduces LUT and FF overhead, with hardware resource utilization efficiency increasing by 123.40% and 58.16%, respectively, compared to the baseline. Furthermore, the data packing techniques enable the HPE to achieve 2× higher throughput than the conventional implementations. This study provides an efficient hardware solution for deep learning acceleration, with broad potential applications, especially in deep learning tasks requiring high computational efficiency and resource optimization.