|
|
Audio-Visual FusionDetection Method of Traffic Volume Based on Cross-Modal Multi-head Attention Mechanism
MA Qinglu1, WU Feifei1, WU Yuechuan1, ZHANG Li1, ZHANG Geng2
2026, 45(1):
95-94.
DOI: 10.3969/j.issn.1674-0696.2026.01.12
Aiming at the problem that traditional visual or audio signals cannot fully capture the detailed information in time domain and frequency domain in traffic volume detection, a traffic volume audio-visual fusion detection method based on cross-modal multi-head attention was proposed. In the proposed method, a cross-modal traffic volume detection model that spanned both audio and video modalities, was established to obtain high-quality traffic visual modal representation and sound modal representation and efficiently fuse them. Firstly, the Res2Net and DCNv3 networks were employed to extract features from audio and video data, while the bi-directional long short-term memory (BiLSTM) network was used to process time series features. The complex behavior sequences in audio and video were respectively analyzed to obtain rich and coherent traffic information description. Secondly, in the cross-modal fusion, cross-attention was integrated with multi-head attention, and multiple subspaces were used to combine the output to perform multi-head attention cross-modal fusion. Finally, the joint application of cross-entropy loss and consistency loss enhanced the coordinated analysis of information from different modalities, ensuring consistent performance of multi-modal data in classification and recognition tasks. Experimental results demonstrate that in the traffic volume detection scenario, the proposed method outperforms the average vehicle detection accuracy of single audio, video, and the fusion method in AVSS (audio-visual speech separation, AVSS) by 2.57%, 1.70%, and 0.95%, respectively; the average vehicle classification accuracy is improved by 4.72%, 1.78%, and 1.62%, respectively; the average detection accuracy of overall traffic volume is enhanced by 4.41%, 2.96%, and 1.46%, respectively. Furthermore, the performance remains stable in four distinct scenarios.
References |
Related Articles |
Metrics
|