Abstract:
The YOLO (You Only Look Once) -based algorithms have been widely used for real-time object detection and achieved promising performance. However, this performance improvement still faces two challenges. First, standard convolutions with limited receptive fields are hard to capture global contextual features, reducing the detection accuracy of complex objects; Second, increasing the convolution kernel size can enhance feature extraction while the computational cost is significantly increased. To address these issues, this paper investigates the SCA-YOLO model by introducing the Alterable Channel-wise Fusion module (C2fAK) and the Context-Aware Attention++ (CAA++) module to enhance performance. The C2fAK module combines deformable convolution with the Channel-wise Fusion (C2f) structure to enhance feature representation capability while balancing computational overhead. The CAA++ module captures long-range contextual information and reduces channel redundancy, further improving detection accuracy. Experimental results show that the proposed SCA-YOLO outperforms existing methods on multiple datasets, demonstrating its effectiveness and efficiency in object detection.