Generative Adversarial Networks Embedding Refinement for Speaker Diarization Improvement

Vinod Pande; Dr. Vijay K. Kale; Dr. Sangramsing N. Kayte

pdf

Published: Aug 6, 2025

Vinod Pande

a:1:{s:5:"en_US";s:94:"Dr. G. Y. Pathrikar College of Computer Science and Information Technology, MGM University";}

Dr. Vijay K. Kale

Dr. G. Y. Pathrikar College of Computer Science and Information Technology, MGM University

Dr. Sangramsing N. Kayte

University in Copenhagen, Denmark

Abstract

The purpose of this research is to incorporate Generative Adversarial Networks(GAN) into the speaker diarization process by refining embeddings along with the overlapping speech and noise problems. In this case, better speaker embeddings are produced by GANs through adversarial learning, which makes them more separable and more powerful than traditional embedding techniques. The practical assessment of the system used the AMI Meeting Corpus as well as the VoxConverse data sets and performance was evaluated across different acoustic conditions. The results support very substantial performance advantages with improvements of 25\% in the Error Rate of Dialysis (DER) in comparison to baseline models. Such models included x vector-based clustering and end-to-end neural diarization systems. In support of this, T-SNE again stunningly verified that the cluster separability of embeddings refined by a GAN improved. Furthermore, the system is flexible for real-world scenarios as it exhibits robust performance even under noisy overlapping speech conditions. This evidence testifies that using GAN for embedding refinement is a very effective method to address the issue of speaker diarization.

How to Cite

Pande, V., Kale, D. V. K., & Kayte, D. S. N. (2025). Generative Adversarial Networks Embedding Refinement for Speaker Diarization Improvement. INFOCOMP Journal of Computer Science, 24(1). Retrieved from https://infocomp.dcc.ufla.br/index.php/infocomp/article/view/5186

Issue

Vol. 24 No. 1 (2025): June 2025

Section

Machine Learning and Computational Intelligence

Upon receipt of accepted manuscripts, authors will be invited to complete a copyright license to publish the paper. At least the corresponding author must send the copyright form signed for publication. It is a condition of publication that authors grant an exclusive licence to the the INFOCOMP Journal of Computer Science. This ensures that requests from third parties to reproduce articles are handled efficiently and consistently and will also allow the article to be as widely disseminated as possible. In assigning the copyright license, authors may use their own material in other publications and ensure that the INFOCOMP Journal of Computer Science is acknowledged as the original publication place.

References

[1] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,

S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural

information processing systems 27 (2014)

[2] Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., Watanabe, S.: End

to-end neural speaker diarization with self-attention. In: 2019 IEEE Automatic

Speech Recognition and Understanding Workshop (ASRU), pp. 296–303 (2019).

IEEE

[3] Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.:

Speaker diarization: A review of recent research. IEEE Transactions on audio,

speech, and language processing 20(2), 356–370 (2012)

[4] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors:

Robust dnn embeddings for speaker recognition. In: 2018 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333

(2018). IEEE

10

[5] Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning

research 9(11) (2008)

[6] Sato, H., Ochiai, T., Delcroix, M., Kinoshita, K., Kamo, N., Moriya, T.: Learn

ing to enhance or not: Neural network-based switching of enhanced and observed

signals for overlapping speech recognition. In: ICASSP 2022-2022 IEEE Inter

national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.

6287–6291 (2022). IEEE

[7] Botelho, C., Teixeira, F., Rolland, T., Abad, A., Trancoso, I.: Pathological speech

detection using x-vector embeddings. arXiv preprint arXiv:2003.00864 (2020)

[8] Miyato, T., Koyama, M.: cgans with projection discriminator. arXiv preprint

arXiv:1802.05637 (2018)

[9] Pascual, S., Bonafonte, A., Serra, J.: Segan: Speech enhancement generative

adversarial network. arXiv preprint arXiv:1703.09452 (2017)

[10] Boeddeker, C., Subramanian, A.S., Wichern, G., Haeb-Umbach, R., Le Roux, J.:

Ts-sep: Joint diarization and separation conditioned on estimated speaker embed

dings. IEEE/ACM Transactions on Audio, Speech, and Language Processing 32,

1185–1197 (2024)

[11] Cord-Landwehr, T., Boeddeker, C., Zoril˘a, C., Doddipatla, R., Haeb-Umbach,

R.: Frame-wise and overlap-robust speaker embeddings for meeting diarization.

In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP), pp. 1–5 (2023). IEEE

[12] Zhou, J., Jiang, T., Li, L., Hong, Q., Wang, Z., Xia, B.: Training multi-task

adversarial network for extracting noise-robust speaker embedding. In: ICASSP

2019-2019 IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP), pp. 6196–6200 (2019). IEEE

[13] Bullock, L., Bredin, H., Garcia-Perera, L.P.: Overlap-aware diarization: Reseg

mentation using neural end-to-end overlapped speech detection. In: Icassp 2020

2020 IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP), pp. 7114–7118 (2020). IEEE

[14] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans

trained by a two time-scale update rule converge to a local nash equilibrium.

Advances in neural information processing systems 30 (2017)

[15] Vinciarelli, A., Valente, F., Yella, S.H., Sapru, A.: Understanding social signals in

multi-party conversations: Automatic recognition of socio-emotional roles in the

ami meeting corpus. In: 2011 IEEE International Conference on Systems, Man,

and Cybernetics, pp. 374–379 (2011). IEEE

11

[16] Zhu, C., Cheng, Y., Gan, Z., Sun, S., Goldstein, T., Liu, J.: Freelb:

Enhanced adversarial training for natural language understanding. arXiv preprint

arXiv:1909.11764 (2019)

[17] Kim, B., Loew, M., Han, D.K., Ko, H.: Deep clustering for improved inter

cluster separability and intra-cluster homogeneity with cohesive loss. IEICE

TRANSACTIONS on Information and Systems 104(5), 776–780 (2021)

Article Sidebar

Main Article Content

Abstract

Article Details

References