[1] TAO Fei, BUSSO C. End-to-end audiovisual speech recognition system with multitask learning [J]. IEEE Transactions on Multimedia, 2020, 23: 1-11.
[2] 隗寒冰,白林.基于多源异构信息融合的智能汽车目标检测算法[J].重庆交通大学学报(自然科学版),2021,40(8):140-149.
WEI Hanbing, BAI Lin. Intelligent vehicle target detection algorithm based on multi-source heterogeneous information fusion [J]. Journal of Chongqing Jiaotong University (Natural Science), 2021,40 (8): 140-149.
[3] YIN Guanghao, LIU Yuanyuan, LIU Tengfei, et al. Token-disentangling mutual transformer for multimodal emotion recognition [J]. Engineering Applications of Artificial Intelligence, 2024, 133: 108348.
[4] 吴建清,张子毅,王钰博,等.考虑多模态数据的重载货车危险驾驶行为识别方法[J].交通运输系统工程与信息,2024,24(2):63-75.
WU Jianqing, ZHANG Ziyi, WANG Yubo, et al. An identification method for dangerous driving behavior of heavy-duty trucks considering multimodal data [J]. Journal of Transportation Systems Engineering and Information Technology, 2024, 24 (2): 63-75.
[5] LU Rui, DUAN Zhiyao, ZHANG Changshui. Audio-visual deep clustering for speech separation [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(11): 1697-1712.
[6] LI Yangke, ZHANG Xinman. Lip landmark-based audio-visual speech enhancement with multimodal feature fusion network [J]. Neurocomputing, 2023, 549: 126432.
[7] 张丽娟,崔天舒,井佩光,等. 基于深度多模态特征融合的短视频分类[J]. 北京航空航天大学学报, 2021, 47(3): 478-485.
ZHANG Lijuan, CUI Tianshu, JING Peiguang, et al. Deep multimodal feature fusion for micro-video classification [J]. Journal of Beijing University of Aeronautics and Astronautics, 2021, 47(3): 478-485.
[8] LIU Shuo, QUAN Weize, WANG Chaoqun, et al. Dense modality interaction network for audio-visual event localization [J]. IEEE Transactions on Multimedia, 2022, 25: 2734-2748.
[9] LI Jiahong, LI Chenda, WU Yifei, et al. Unified cross-modal attention: Robust audio-visual speech recognition and beyond [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 1941-1953.
[10] BROUSMICHE M, ROUAT J, DUPONT S.Multimodal Attentive Fusion Network for audio-visual event recognition [J]. Information Fusion, 2022, 85: 52-59.
[11] ZHU Dandan, ZHANG Kaiwei, ZHANG Nana, et al. Unified audio-visual saliency model for omnidirectional videos with spatial audio [J]. IEEE Transactions on Multimedia, 2023, 26: 764-775.
[12] XUE Cheng, ZHONG Xionghu, CAI Minjie, et al. Audio-visual event localization by learning spatial and semantic co-attention [J]. IEEE Transactions on Multimedia, 2021, 25: 418-429.
[13] GHOSH S, SARKAR S, GHOSH S, et al. Audio-visual speech synthesis using vision transformer–enhanced autoencoders with ensemble of loss functions [J]. Applied Intelligence, 2024, 54(6): 4507-4524.
[14] WU Shaojie, SUN D J, QIU Guo. Emission analysis based on mixed traffic flow and license plate recognition model [J]. Transportation Research Part D: Transport and Environment, 2024, 134: 104331.
[15] 丁建立,张琪琪,王静,等. 基于Transformer-VAE的ADS-B异常检测方法[J]. 系统工程与电子技术, 2023, 45(11): 3680-3689.
DING Jianli, ZHANG Qiqi, WANG Jing, et al. ADS-B anomaly detection method based on Transformer-VAE [J]. System Engineering and Electronic Technology, 2023, 45(11): 3680-3689.
[16] 王雪秋,高焕兵,郏泽萌. 改进YOLOv8的道路缺陷检测算法[J]. 计算机工程与应用, 2024, 60(17): 179-190.
WANG Xueqiu, GAO Huanbin, JIA Zemeng. Improved YOLOv8 road defect detection algorithm [J]. Computer Engineering and Applications, 2024, 60(17): 179-190. |