中文核心期刊
CSCD来源期刊
中国科技核心期刊
RCCSE中国核心学术期刊

Journal of Chongqing Jiaotong University(Natural Science) ›› 2026, Vol. 45 ›› Issue (1): 95-94.DOI: 10.3969/j.issn.1674-0696.2026.01.12

• Traffic & Transportation+Artificial Intelligence • Previous Articles    

Audio-Visual FusionDetection Method of Traffic Volume Based on Cross-Modal Multi-head Attention Mechanism

MA Qinglu1, WU Feifei1, WU Yuechuan1, ZHANG Li1, ZHANG Geng2   

  1. (1. School of Traffic & Transportation, Chongqing Jiaotong University, Chongqing 400074, China; 2. School of Computer and Information Science, Southwest University, Chongqing 400715, China)
  • Received:2025-01-24 Revised:2025-05-19 Published:2026-01-15

基于跨模态多头注意力的交通量音视融合检测方法

马庆禄1,吴斐斐1,吴跃川1,张丽1,张埂2   

  1. (1. 重庆交通大学 交通运输学院,重庆 400074; 2. 西南大学 计算机与信息科学学院,重庆 400715)
  • 作者简介:马庆禄(1980—),男,陕西渭南人,教授,博士 ,主要从事智能交通与安全方面的研究。E-mail:qlm@cqjtu.edu.cn
  • 基金资助:
    国家自然科学基金项目(52072054);重庆市自然科学基金面上项目(CSTB2023NSCQ-MSX0551);重庆交通大学研究生科研创新资助项目(CYS240483)

Abstract: Aiming at the problem that traditional visual or audio signals cannot fully capture the detailed information in time domain and frequency domain in traffic volume detection, a traffic volume audio-visual fusion detection method based on cross-modal multi-head attention was proposed. In the proposed method, a cross-modal traffic volume detection model that spanned both audio and video modalities, was established to obtain high-quality traffic visual modal representation and sound modal representation and efficiently fuse them. Firstly, the Res2Net and DCNv3 networks were employed to extract features from audio and video data, while the bi-directional long short-term memory (BiLSTM) network was used to process time series features. The complex behavior sequences in audio and video were respectively analyzed to obtain rich and coherent traffic information description. Secondly, in the cross-modal fusion, cross-attention was integrated with multi-head attention, and multiple subspaces were used to combine the output to perform multi-head attention cross-modal fusion. Finally, the joint application of cross-entropy loss and consistency loss enhanced the coordinated analysis of information from different modalities, ensuring consistent performance of multi-modal data in classification and recognition tasks. Experimental results demonstrate that in the traffic volume detection scenario, the proposed method outperforms the average vehicle detection accuracy of single audio, video, and the fusion method in AVSS (audio-visual speech separation, AVSS) by 2.57%, 1.70%, and 0.95%, respectively; the average vehicle classification accuracy is improved by 4.72%, 1.78%, and 1.62%, respectively; the average detection accuracy of overall traffic volume is enhanced by 4.41%, 2.96%, and 1.46%, respectively. Furthermore, the performance remains stable in four distinct scenarios.

Key words: traffic engineering; traffic volume detection; audio-visual fusion; cross-modal; attention mechanism

摘要: 针对传统视觉或音频信号在交通量检测中均无法充分捕捉时域和频域细节信息的问题,提出了基于跨模态多头注意力的交通量音视融合检测方法。该方法通过构建一种跨音视频模态的交通量检测模型,获取高质量的交通视觉模态表征和声音模态表征并进行高效融合。首先,采用Res2Net网络与DCNv3网络对音视频数据进行特征提取,通过双向长短期记忆网络对时间序列特征处理,分别分析了音视频中复杂行为序列,获取丰富性和连贯性的交通信息描述;其次,在跨模态融合中将交叉注意力与多头注意力结合,利用多个子空间进行合并输出,进行多头注意力跨模态融合;最后,将交叉熵损失和一致性损失联合应用,加强对不同模态信息的协调解析,确保多模态数据在分类和识别任务中的一致性表现。实验结果表明在交通量检测场景下笔者方法分别比单一音频、视频与AVSS(audio-visual speech separation, AVSS)中的融合方法的平均车辆检测准确率提高了2.57%、1.70%、0.95%,车辆平均分类准确率分别提高了4.72%、1.78%、1.62%,总体交通量平均检测准确率分别提高了4.41%、2.96%、1.46%,且在4种不同场景下表现稳定。

关键词: 交通工程;交通量检测;音视融合;跨模态;注意力机制

CLC Number: