Generative Adversarial Networks Embedding Refinement for Speaker Diarization Improvement
Main Article Content
Abstract
The purpose of this research is to incorporate Generative Adversarial Networks(GAN) into the speaker diarization process by refining embeddings along with the overlapping speech and noise problems. In this case, better speaker embeddings are produced by GANs through adversarial learning, which makes them more separable and more powerful than traditional embedding techniques. The practical assessment of the system used the AMI Meeting Corpus as well as the VoxConverse data sets and performance was evaluated across different acoustic conditions. The results support very substantial performance advantages with improvements of 25\% in the Error Rate of Dialysis (DER) in comparison to baseline models. Such models included x vector-based clustering and end-to-end neural diarization systems. In support of this, T-SNE again stunningly verified that the cluster separability of embeddings refined by a GAN improved. Furthermore, the system is flexible for real-world scenarios as it exhibits robust performance even under noisy overlapping speech conditions. This evidence testifies that using GAN for embedding refinement is a very effective method to address the issue of speaker diarization.
Article Details
Upon receipt of accepted manuscripts, authors will be invited to complete a copyright license to publish the paper. At least the corresponding author must send the copyright form signed for publication. It is a condition of publication that authors grant an exclusive licence to the the INFOCOMP Journal of Computer Science. This ensures that requests from third parties to reproduce articles are handled efficiently and consistently and will also allow the article to be as widely disseminated as possible. In assigning the copyright license, authors may use their own material in other publications and ensure that the INFOCOMP Journal of Computer Science is acknowledged as the original publication place.
References
[1] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,
S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural
information processing systems 27 (2014)
[2] Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., Watanabe, S.: End
to-end neural speaker diarization with self-attention. In: 2019 IEEE Automatic
Speech Recognition and Understanding Workshop (ASRU), pp. 296–303 (2019).
IEEE
[3] Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.:
Speaker diarization: A review of recent research. IEEE Transactions on audio,
speech, and language processing 20(2), 356–370 (2012)
[4] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors:
Robust dnn embeddings for speaker recognition. In: 2018 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333
(2018). IEEE
10
[5] Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning
research 9(11) (2008)
[6] Sato, H., Ochiai, T., Delcroix, M., Kinoshita, K., Kamo, N., Moriya, T.: Learn
ing to enhance or not: Neural network-based switching of enhanced and observed
signals for overlapping speech recognition. In: ICASSP 2022-2022 IEEE Inter
national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.
6287–6291 (2022). IEEE
[7] Botelho, C., Teixeira, F., Rolland, T., Abad, A., Trancoso, I.: Pathological speech
detection using x-vector embeddings. arXiv preprint arXiv:2003.00864 (2020)
[8] Miyato, T., Koyama, M.: cgans with projection discriminator. arXiv preprint
arXiv:1802.05637 (2018)
[9] Pascual, S., Bonafonte, A., Serra, J.: Segan: Speech enhancement generative
adversarial network. arXiv preprint arXiv:1703.09452 (2017)
[10] Boeddeker, C., Subramanian, A.S., Wichern, G., Haeb-Umbach, R., Le Roux, J.:
Ts-sep: Joint diarization and separation conditioned on estimated speaker embed
dings. IEEE/ACM Transactions on Audio, Speech, and Language Processing 32,
1185–1197 (2024)
[11] Cord-Landwehr, T., Boeddeker, C., Zoril˘a, C., Doddipatla, R., Haeb-Umbach,
R.: Frame-wise and overlap-robust speaker embeddings for meeting diarization.
In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 1–5 (2023). IEEE
[12] Zhou, J., Jiang, T., Li, L., Hong, Q., Wang, Z., Xia, B.: Training multi-task
adversarial network for extracting noise-robust speaker embedding. In: ICASSP
2019-2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), pp. 6196–6200 (2019). IEEE
[13] Bullock, L., Bredin, H., Garcia-Perera, L.P.: Overlap-aware diarization: Reseg
mentation using neural end-to-end overlapped speech detection. In: Icassp 2020
2020 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pp. 7114–7118 (2020). IEEE
[14] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans
trained by a two time-scale update rule converge to a local nash equilibrium.
Advances in neural information processing systems 30 (2017)
[15] Vinciarelli, A., Valente, F., Yella, S.H., Sapru, A.: Understanding social signals in
multi-party conversations: Automatic recognition of socio-emotional roles in the
ami meeting corpus. In: 2011 IEEE International Conference on Systems, Man,
and Cybernetics, pp. 374–379 (2011). IEEE
11
[16] Zhu, C., Cheng, Y., Gan, Z., Sun, S., Goldstein, T., Liu, J.: Freelb:
Enhanced adversarial training for natural language understanding. arXiv preprint
arXiv:1909.11764 (2019)
[17] Kim, B., Loew, M., Han, D.K., Ko, H.: Deep clustering for improved inter
cluster separability and intra-cluster homogeneity with cohesive loss. IEICE
TRANSACTIONS on Information and Systems 104(5), 776–780 (2021)