基于RT-DETR-DV的密集异构车辆目标检测模型

欧东源; 狐亚奇; 罗魏男; 余荣

doi:10.12052/gdutxb.260017

摘要: 复杂交通场景中，车流检测和车辆追踪的实现依赖于精准的车辆检测和定位，目前其取得了很大突破。但是，在密集的车辆交通流场景中，常会出现车辆多尺度、重叠与遮挡等问题，给车辆检测带来了新的挑战。针对上述问题，提出面向密集交通场景的车辆检测改进模型——实时密集车辆检测模型(Dense Vehicle Real-Time Detection Transformer, RT-DETR-DV)。基于实时检测变换器(Real-Time Detection Transformer, RT-DETR) 模型框架，首先，针对异构车辆的漏检问题，引入多尺寸车辆检测(Multi-Scale Vehicle Detection, MSVD) 模块来更好地提取和融合不同尺度的特征；其次，为更好地处理重叠、遮挡问题，设计密集车辆特征分离(Dense Vehicle Feature Separation, DVFS) 模块通过特征金字塔网络(Feature Pyramid Network, FPN) 分支剥离重叠的车辆特征，从而提高特征区分度；最后，为增强小目标车辆的检测能力及加速模型训练收敛，提出动态损失函数机制。以BIT-Vehicle和Venom数据集进行对比实验，实验结果表明RT-DETR-DV模型参数量仅为19.8 M，对比基线模型下降幅度为9.4%，浮点运算次数(Floating Point Operations, FLOPs) 下降到27.9 G，下降幅度为7.7%，并且有效提升检测帧率。与此同时，平均精度均值(mean Average Precision, mAP50:95) 分别提升了0.6个百分点和1.8个百分点，还使用类激活热力图(Gradient-weighted Class Activation Mapping, Grad-CAM) 对该模型在密集交通场景下的目标特征关注能力和检测鲁棒性能进行了验证。

Abstract: In complex traffic scenarios, the implementation of vehicle flow detection and tracking relies heavily on accurate vehicle detection and localization, where significant breakthroughs have been achieved. However, in dense trafficenvironmens, challenges such as multi-scale vehicles, overlapping, and occlusion frequently arise, imposing new demands to vehicle detection. To address these issues, an improved vehicle detection model for dense traffic scenarios, termed real-time detection transformer for dense vehicles (RT-DETR-DV) , is proposed. Based on the real-time detection transformer (RT-DETR) framework, a multi-scale vehicle detection (MSVD) module is first introduced enhance the extraction and fusion of features across different scales, thereby reducing missed detections of heterogeneous vehicles. Second, to better handle overlap and occlusion issues, a dense vehicle feature separation (DVFS) module is designed to separate overlapping vehicle features through a feature pyramid network (FPN) branch, thereby enhancing feature discriminability. Finally, to improve the detection capability for small object vehicles and accelerate model training convergence, a dynamic loss function mechanism is proposed. Comparative experiments conducted on the BIT-Vehicle and Venom datasets show that the RT-DETR-DV model contains only 19.8 M parameters, representing a 9.4% reduction compared to the baseline model. Its floating point operations (FLOPs) decrease to 27.9 G, a reduction of 7.7%, while the detection frame rate is effectively improved. Meanwhile, the mean average precision (mAP50:95) increases by 0.6 and 1.8 percentage points on the two datasets, respectively. Additionally, the gradient-weighted class activation mapping (Grad-CAM) is used to validate the model’s ability to focus on object features and its robustness in dense traffic detection scenarios.

基于RT-DETR-DV的密集异构车辆目标检测模型

Dense Heterogeneous Vehicle Object Detection Model Based on RT-DETR-DV