基于NMF与CNN联合优化的声学场景分类

doi:10.12305/j.issn.1001-506X.2022.05.01

摘要/Abstract

摘要：

针对声学场景分类任务中复杂声学环境的特征表示问题, 提出一种联合训练特征提取和分类模型的优化算法。将非负矩阵分解与卷积神经网络的训练相结合, 利用网络的损失值实现对特征提取和网络参数的共同更新, 以学习到更具判别性的有监督特征。在TUT2017数据集上提取对数声谱图作为基础特征, 搭建深度卷积神经网络进行实验验证。仿真结果表明, 所提算法的识别准确率相比优化前提升3.9%, 且优于其他两种常用声学特征, 证明该算法能够有效提升整体分类效果。

关键词: 特征学习, 非负矩阵分解, 卷积神经网络, 联合优化

Abstract:

To solve the problem of feature representation of complex acoustic environment in acoustic scene classification task, an optimization algorithm of joint training feature extraction and classification model is proposed. In order to learn more discriminative and supervised features, non-negative matrix factorization is combined with convolution neural network training, and the loss value of network is used to realize feature extraction and network parameters updating. The logarithmic spectrogram is extracted from the TUT2017 dataset as the basic feature. And the deep convolutional neural network is built for experimental verification.The simulation results show that the recognition accuracy of the proposed algorithm is improved by 3.9% compared with that before optimization, and is superior to the other two commonly used acoustic features, which proves that the algorithm can effectively improve the overall classification effect.

Key words: feature learning, non-negative matrix factorization, convolutional neural network, joint optimization

中图分类号:

韦娟, 杨皇卫, 宁方立. 基于NMF与CNN联合优化的声学场景分类[J]. 系统工程与电子技术, 2022, 44(5): 1433-1438.

Juan WEI, Huangwei YANG, Fangli NING. Acoustic scene classification based on joint optimization of NMF and CNN[J]. Systems Engineering and Electronics, 2022, 44(5): 1433-1438.

图/表 5

图1

表1

表2

表3

表4

参考文献 30

1	PASEDDULA C , GANGASHETTY S V . Late fusion framework for acoustic scene classification using LPCC, SCMC, and log-Mel band energies with deep neural networks[J]. Applied Acoustics, 2021, 172, 107568. doi: 10.1016/j.apacoust.2020.107568
2	刘立芳, 杨海霞, 齐小刚. 基于线性判别分析的时频域特征提取算法[J]. 系统工程与电子技术, 2019, 41 (10): 2184- 2190. doi: 10.3969/j.issn.1001-506X.2019.10.05
	LIU L F , YANG H X , QI X G . Time-frequency domain feature extraction algorithm based on linear discriminant analysis[J]. Systems Engineering and Electronics, 2019, 41 (10): 2184- 2190. doi: 10.3969/j.issn.1001-506X.2019.10.05
3	MCDONNELL M D, GAO W. Acoustic scene classification using deep residual networks with late fusion of separated high and low frequency paths[C]//Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2020.
4	SONG H W, HAN J Q, DENG S W, et al. Acoustic scene classification by implicitly identifying distinct sound events[C]//Proc. of the Interspeech, 2019: 3860-3864.
5	WANG M, WANG R, ZHANG X L, et al. Hybrid constant-Q transform based CNN ensemble for acoustic scene classification[C]//Proc. of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2019: 1511-1516.
6	BISOT V , SERIZEL R , ESSID S , et al. Feature learning with matrix factorization applied to acoustic scene classification[J]. IEEE/ACM Trans.on Audio Speech & Language Processing, 2017, 25 (6): 1216- 1229.
7	SPRECHMANN P, BRONSTEIN A M, SAPIRO G. Supervised non-euclidean sparse NMF via bilevel optimization with applications to speech enhancement[C]//Proc. of the Hands-free Speech Communication and Microphone Arrays, 2014: 11-15.
8	PODWINSKA Z, SOBIERAJ I, FAZENDA B M, et al. Acoustic event detection from weakly labeled data using auditory salience[C]//Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2019.
9	姚琨, 杨吉斌, 张雄伟, 等. 基于多分辨率时频特征融合的声学场景分类[J]. 声学技术, 2020, 39 (4): 108- 114.
	YAO K , YANG J B , ZHANG X W , et al. Acoustic scene classification based on multi-resolution time-frequency feature fusion[J]. Acoustic Technology, 2020, 39 (4): 108- 114.
10	LEE S , PANG H S . Feature extraction based on the non-negative matrix factorization of convolutional neural networks for monitoring domestic activity with acoustic signals[J]. IEEE Access, 2020, 8, 122384- 122395. doi: 10.1109/ACCESS.2020.3007199
11	BISOT V, SERIZEL R, ESSID S, et al. Supervised non-negative matrix factorization for acoustic scene classification[C]//Proc. of the IEEE International Evaluation Campaign on Detection and Classification of Acousitc Scenes and Events, 2016.
12	SALAMON J , BELLOJ P . Deep convolutional neural networks and data augmentation for environmental sound classification[J]. IEEE Signal Processing Letters, 2017, 24 (3): 279- 283. doi: 10.1109/LSP.2017.2657381
13	杨浩聪, 史创, 李会勇. 保留立体声相位信息的声音场景分类系统[J]. 信号处理, 2020, 36 (6): 871- 878.
	YANG H C , SHI C , LI H Y . Sound scene classification system preserving stereo phase information[J]. Signal Processing, 2020, 36 (6): 871- 878.
14	BODDAPATI V , PETEF A , RASMUSSON J , et al. Classifying environmental sounds using image recognition networks[J]. Procedia Computer Science, 2017, 112, 2048- 2056. doi: 10.1016/j.procs.2017.08.250
15	DOAN T, NGUYEN H, NGO D T, et al. Acoustic scene classification using adeeper training method for convolution neural network[C]//Proc. of the International Symposium on Electrical and Electronics Engineering, 2019: 63-67.
16	曹毅, 黄子龙, 张威, 等. N-DenseNet的城市声音事件分类模型[J]. 西安电子科技大学学报, 2019, 46 (6): 9- 16.9-16, 94
	CAO Y , HUANG Z L , ZHANG W , et al. Urban sound event classification model based on N-DenseNet[J]. Journal of Xidian University, 2019, 46 (6): 9- 16.9-16, 94
17	李伟, 李硕. 理解数字声音——基于一般音频/环境声的计算机听觉综述[J]. 复旦学报(自然科学版), 2019, 58 (3): 269- 313.
	LI W , LI S . Understanding digital sound: a review of computer hearing based on general audio/ambient sound[J]. Journal of Fudan University (Natural Science Edition), 2019, 58 (3): 269- 313.
18	KOMATSU T, SENDA Y, KONDO R. Acoustic event detection based on non-negative matrix factorization with mixtures of local dictionaries and activation aggregation[C]//Proc. of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2016: 2259-2263.
19	GIANNOULIS P, POTAMIANOS G, MARAGOS P. Multi-channel non-negative matrix factorization for overlapped acoustic event detection[C]//Proc. of the 26th European Signal Processing Conference, 2018: 857-861.
20	MAIRAL J , BACH F , PONCE J . Task-driven dictionary learning[J]. IEEE Trans.on Pattern Analysis & Machine Intelligence, 2012, 34 (4): 791- 804.
21	RAKOTOMAMONJY A . Supervised representation learning for audio scene classification[J]. IEEE/ACM Trans.on Audio, Speech, and Language Processing, 2017, 25 (6): 1253- 1265. doi: 10.1109/TASLP.2017.2690561
22	PHAM L, MCLOUGHLIN I, PHAN H, et al. A robust framework for acoustic scene classification[C]//Proc. of the Interspeech, 2019: 3634-3638.
23	LI X Y, CHEBIYYAM V, KIRCHHOFF K. Multi-stream network with temporal attention for environmental sound classification[C]//Proc. of the Interspeech, 2019: 3604-3608.
24	KONG Q, CAO Y, IQBAL T, et al. Cross-task learning for audio tagging, sound event detection and spatial localization: Dcase 2019 baseline systems[EB/OL]. [2021-05-28]. http://arxiv.org/abs/1904.03476v3.
25	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale imagerecognition[EB/OL]. [2021-05-28]. http://arxiv.org/abs/1409.1556v6.
26	MCDONNELL M D. Training wide residual networks for deployment using a single bit for each weight[EB/OL]. [2021-05-28]. http://arxiv.org/abs/1802.08530.
27	MESAROS A, HEITTOLA T, DIMENT A, et al. DCASE 2017 Challenge setup: tasks, datasets and baseline system[C]//Proc. of the Detection and Classification of Acoustic Scenes and Events Workshop, 2017: 85-92.
28	WANG H L, ZOU Y X, CHONG D D. Acoustic scene classification with spectrogram processing strategies[C]//Pro. of the Detection and Classification of Acoustic Scenes and Events Workshop, 2020.
29	WANG C, SANTOSO A, WANG J. Acoustic scene classification using self-determination convolutional neural network[C]//Proc. of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, 2017: 19-22.
30	DANG A, VUT H, WANG J. Acoustic scene classification using convolutional neural networks and multi-scale multi-feature extraction[C]//Proc. of the IEEE International Conference on Consumer Electronics, 2018.

名称	CNN8	CNN10	CNN12
输入层	256×108×1	256×108×1	256×108×1
批归一化层, 卷积层	BN, 3×3@64	BN, 3×3@64	BN, 3×3@64
批归一化层, 激活层, 卷积层	BN, ReLu, 3×3@64	BN, ReLu, 3×3@64	BN, ReLu, 3×3@64
池化层	4×2AvgPooling	4×2AvgPooling	4×2AvgPooling
批归一化层, 激活层卷积层	$\left( \begin{array}{l}{\rm{BN}}, {\rm{ReLu}}\\3 \times 3@128\end{array} \right) \times 2$	$\left( \begin{array}{l}{\rm{BN}}, {\rm{ReLu}}\\3 \times 3@128\end{array} \right) \times 2$	$\left( \begin{array}{l}{\rm{BN}}, {\rm{ReLu}}\\3 \times 3@128\end{array} \right) \times 2$
池化层	4×2AvgPooling	4×2AvgPooling	4×2AvgPooling
批归一化层, 激活层卷积层	$\left( \begin{array}{l}{\rm{BN}}, {\rm{ReLu}}\\3 \times 3@256\end{array} \right) \times 2$	$\left( \begin{array}{l}{\rm{BN}}, {\rm{ReLu}}\\3 \times 3@256\end{array} \right) \times 2$	$\left( \begin{array}{l}{\rm{BN}}, {\rm{ReLu}}\\3 \times 3@256\end{array} \right) \times 2$
池化层	—	2×1AvgPooling	2×1AvgPooling
批归一化层, 激活层卷积层	— —	$\left( \begin{array}{l}{\rm{BN}}, {\rm{ReLu}}\\3 \times 3@512\end{array} \right) \times 2$	$\left( \begin{array}{l}{\rm{BN}}, {\rm{ReLu}}\\3 \times 3@512\end{array} \right) \times 2$
池化层	—	—	2×1AvgPooling
批归一化层, 激活层卷积层	— —	— —	$\left( \begin{array}{l}{\rm{BN}}, {\rm{ReLu}}\\3 \times 3@1024\end{array} \right) \times 2$
批归一化层, 激活层, 卷积层	BN, ReLu, 1×1@1024
批归一化层, 卷积层, 全局池化层	BN, 1×1@15, Global AvgPooling
全连接层, 输出层	Dense(15), Softmax

SNMF	Fold1	Fold2	Fold3	Fold4	平均
K=64	0.781	0.795	0.771	0.824	0.793
K=128	0.805	0.837	0.793	0.854	0.822
K=256	0.827	0.839	0.814	0.863	0.836
K=512	0.818	0.831	0.807	0.855	0.828

模型	Fold1	Fold2	Fold3	Fold4	平均
CNN8	0.808	0.807	0.778	0.806	0.800
CNN10	0.827	0.839	0.814	0.863	0.836
CNN12	0.811	0.815	0.788	0.861	0.819

场景	基线系统	NMF	TNMF	SNMF	CQT	LM
沙滩	0.753	0.751	0.747	0.835	0.895	0.887
公交	0.718	0.893	0.813	0.928	0.930	0.922
饭馆	0.577	0.618	0.544	0.793	0.611	0.628
汽车	0.971	0.962	0.945	0.942	0.978	0.941
市中心	0.907	0.943	0.867	0.893	0.778	0.920
林荫道	0.795	0.769	0.892	0.925	0.881	0.855
杂货店	0.587	0.801	0.828	0.920	0.883	0.929
家	0.686	0.702	0.662	0.792	0.820	0.663
图书馆	0.571	0.725	0.691	0.658	0.783	0.685
地铁站	0.917	0.742	0.826	0.815	0.852	0.747
办公室	0.998	0.965	0.950	0.941	0.875	0.942
公园	0.702	0.695	0.712	0.705	0.545	0.723
居民区	0.641	0.874	0.774	0.738	0.691	0.764
火车	0.580	0.657	0.768	0.802	0.685	0.712
电车	0.817	0.852	0.847	0.851	0.864	0.876
总体	0.748	0.797	0.791	0.836	0.805	0.813
预测时间/s	-	2.6	1.1	2.7	3.1	3.3

[1]	宋爽, 张悦, 张琳娜, 岑翼刚, 李浥东. 基于深度学习的轻量化目标检测算法[J]. 系统工程与电子技术, 2022, 44(9): 2716-2725.
[2]	王彩云, 吴钇达, 王佳宁, 马璐, 赵焕玥. 基于改进的CNN和数据增强的SAR目标识别[J]. 系统工程与电子技术, 2022, 44(8): 2483-2487.
[3]	陈冬, 句彦伟. 基于语义分割实现的SAR图像舰船目标检测[J]. 系统工程与电子技术, 2022, 44(4): 1195-1201.
[4]	方伟, 王玉, 闫文君, 林冲. 基于神经网络的符号化飞行动作识别[J]. 系统工程与电子技术, 2022, 44(3): 737-745.
[5]	孙晶明, 虞盛康, 孙俊. 基于深度学习的HRRP识别姿态敏感性分析[J]. 系统工程与电子技术, 2022, 44(3): 802-807.
[6]	李京峰, 陈云翔, 项华春, 王健. 考虑随机冲击影响的多部件系统视情维修与备件库存联合优化[J]. 系统工程与电子技术, 2022, 44(3): 875-883.
[7]	刘恒燕, 张立民, 闫文君, 钟兆根, 凌青, 梁晓军. 基于WBP-CNN算法的LDPC译码[J]. 系统工程与电子技术, 2022, 44(3): 1030-1035.
[8]	邵凯, 朱苗苗, 王光宇. 基于生成对抗与卷积神经网络的调制识别方法[J]. 系统工程与电子技术, 2022, 44(3): 1036-1043.
[9]	张玺, 金正猛, 姜亚琴. 融合深度图像先验的全变差图像着色算法[J]. 系统工程与电子技术, 2022, 44(2): 385-393.
[10]	金涛, 王晓峰, 田润澜, 张歆东. 基于改进1DCNN+TCN的雷达辐射源快速识别方法[J]. 系统工程与电子技术, 2022, 44(2): 463-469.
[11]	吕勤哲, 全英汇, 沙明辉, 董淑仙, 邢孟道. 基于集成深度学习的有源干扰智能分类[J]. 系统工程与电子技术, 2022, 44(12): 3595-3602.
[12]	曹亚丽, 李梅梅, 屈诗涵, 宋昕. 联合准则下的认知雷达波形设计[J]. 系统工程与电子技术, 2022, 44(11): 3364-3370.
[13]	高涌荇, 王旭东, 汪玲, 朱岱寅, 郭军, 孟凡旺. 基于RCNN的双极化气象雷达天气信号检测[J]. 系统工程与电子技术, 2022, 44(11): 3380-3387.
[14]	李永刚, 朱卫纲, 黄琼男, 李云涛, 何永华. 复杂背景下SAR图像近岸舰船目标检测[J]. 系统工程与电子技术, 2022, 44(10): 3096-3103.
[15]	但波, 付哲泉, 高山, 简涛. 基于卷积神经网络的海面目标全极化高分辨距离像识别技术[J]. 系统工程与电子技术, 2022, 44(1): 108-116.