Abstract:
Electroencephalogram (EEG) -based emotion recognition is an essential intelligent technique for health assessment and clinical intervention. However, EEG signals exhibit complex and complementary non-linear correlations across spatio-temporal-frequency domains, posing significant challenges to effective feature modeling and downstream emotion recognition performance. To address these challenges, an Emotional Spatio-Temporal-Spectral Cross-Attention Network (ESTSCA-Net) is proposed. The proposed model adopts a dual-branch feature fusion architecture: in the spatio-temporal branch, a multi-scale 2D convolutional network is designed to sequentially process spatio-temporal information, adaptively capturing the contextual dependencies of neural activities; in the spatio-spectral branch, a 3D bottleneck residual network with channel-wise and cross-frequency attention mechanisms is developed to selectively encode critical spatio-spectral neural oscillations. Furthermore, a bidirectional multi-head cross-attention interaction strategy is introduced to achieve deep fusion of spatio-temporal-spectral features, forming an effective emotion representation classifier. Experimental results on the public DEAP and MEEG datasets demonstrate that ESTSCA-Net can comprehensively extract spatio-temporal-spectral EEG features across different emotional states and consistently outperforms state-of-the-art baseline models in both arousal and valence metrics.