基于块浮点的混合精度卷积神经网络加速器

    The Hybrid Precision Convolutional Neural Network Accelerator Based on Block Floating Point

    • 摘要: 随着深度学习中卷积神经网络(Convolutional Neural Network, CNN)模型的不断发展,计算复杂度和硬件资源消耗成为了限制计算效率的重要瓶颈。本文提出了一种基于块浮点(Block Float Point, BFP)的混合精度卷积处理单元(Hybrid Processing Element, HPE),该单元通过使用数字信号处理器(Digital Signal Processor, DSP)代替传统的查找表(Look-Up Table, LUT),结合数据打包技术,在硬件架构中优化了卷积计算单元的设计。该设计通过灵活切换INT4和BFP8两种计算模式,显著提高了计算性能并降低了硬件资源消耗。实验结果表明,HPE在使用混合精度(INT4和BFP8)计算模式时,相比于基准设计,LUT和寄存器(Flip-Flop, FF)的开销显著减少,且硬件资源使用效率分别提高了123.40%和58.16%。此外,通过数据打包技术,HPE的计算加速比达到传统方法的2倍,极大地提升了计算性能。本文的研究为深度学习加速提供了高效的硬件解决方案,具有广泛的应用潜力,特别是在需要高效计算和资源优化的深度学习任务中。

       

      Abstract: With the continuous development of convolutional neural networks (CNNs) in deep learning, computational complexity and hardware resource consumption have become the significant bottlenecks limiting computational efficiency. This paper proposes a hybrid processing unit (HPE) based on Block Floating Point (BFP) , which optimizes the design of the convolution computation unit in the hardware architecture by replacing the traditional Look-up Table (LUT) with DSP and employing data packing techniques. This design enables flexible switching between INT4 and BFP8 computation modes, significantly improving computational performance and reducing hardware resource consumption. Experimental results show that, when using a hybrid precision (INT4 and BFP8) computation mode, HPE significantly reduces LUT and FF overhead, with hardware resource utilization efficiency increasing by 123.40% and 58.16%, respectively, compared to the baseline. Furthermore, the data packing techniques enable the HPE to achieve 2× higher throughput than the conventional implementations. This study provides an efficient hardware solution for deep learning acceleration, with broad potential applications, especially in deep learning tasks requiring high computational efficiency and resource optimization.

       

    /

    返回文章
    返回