Generative Adversarial Networks Embedding Refinement for Speaker Diarization Improvement

Main Article Content

Vinod Pande
Dr. Vijay K. Kale
Dr. Sangramsing N. Kayte

Abstract

The purpose of this research is to incorporate Generative Adversarial Networks(GAN) into the speaker diarization process by refining embeddings along with the overlapping speech and noise problems. In this case, better speaker embeddings are produced by GANs through adversarial learning, which makes them more separable and more powerful than traditional embedding techniques. The practical assessment of the system used the AMI Meeting Corpus as well as the VoxConverse data sets and performance was evaluated across different acoustic conditions. The results support very substantial performance advantages with improvements of 25\% in the Error Rate of Dialysis (DER) in comparison to baseline models. Such models included x vector-based clustering and end-to-end neural diarization systems. In support of this, T-SNE again stunningly verified that the cluster separability of embeddings refined by a GAN improved. Furthermore, the system is flexible for real-world scenarios as it exhibits robust performance even under noisy overlapping speech conditions. This evidence testifies that using GAN for embedding refinement is a very effective method to address the issue of speaker diarization.

Article Details

How to Cite
Pande, V., Kale, D. V. K., & Kayte, D. S. N. (2025). Generative Adversarial Networks Embedding Refinement for Speaker Diarization Improvement. INFOCOMP Journal of Computer Science, 24(1). Retrieved from https://infocomp.dcc.ufla.br/index.php/infocomp/article/view/5186
Section
Machine Learning and Computational Intelligence

References

[1] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair,

S., Courville, A., Bengio, Y.: Generative adversarial nets. Advances in neural

information processing systems 27 (2014)

[2] Fujita, Y., Kanda, N., Horiguchi, S., Xue, Y., Nagamatsu, K., Watanabe, S.: End

to-end neural speaker diarization with self-attention. In: 2019 IEEE Automatic

Speech Recognition and Understanding Workshop (ASRU), pp. 296–303 (2019).

IEEE

[3] Anguera, X., Bozonnet, S., Evans, N., Fredouille, C., Friedland, G., Vinyals, O.:

Speaker diarization: A review of recent research. IEEE Transactions on audio,

speech, and language processing 20(2), 356–370 (2012)

[4] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors:

Robust dnn embeddings for speaker recognition. In: 2018 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333

(2018). IEEE

10

[5] Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of machine learning

research 9(11) (2008)

[6] Sato, H., Ochiai, T., Delcroix, M., Kinoshita, K., Kamo, N., Moriya, T.: Learn

ing to enhance or not: Neural network-based switching of enhanced and observed

signals for overlapping speech recognition. In: ICASSP 2022-2022 IEEE Inter

national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.

6287–6291 (2022). IEEE

[7] Botelho, C., Teixeira, F., Rolland, T., Abad, A., Trancoso, I.: Pathological speech

detection using x-vector embeddings. arXiv preprint arXiv:2003.00864 (2020)

[8] Miyato, T., Koyama, M.: cgans with projection discriminator. arXiv preprint

arXiv:1802.05637 (2018)

[9] Pascual, S., Bonafonte, A., Serra, J.: Segan: Speech enhancement generative

adversarial network. arXiv preprint arXiv:1703.09452 (2017)

[10] Boeddeker, C., Subramanian, A.S., Wichern, G., Haeb-Umbach, R., Le Roux, J.:

Ts-sep: Joint diarization and separation conditioned on estimated speaker embed

dings. IEEE/ACM Transactions on Audio, Speech, and Language Processing 32,

1185–1197 (2024)

[11] Cord-Landwehr, T., Boeddeker, C., Zoril˘a, C., Doddipatla, R., Haeb-Umbach,

R.: Frame-wise and overlap-robust speaker embeddings for meeting diarization.

In: ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and

Signal Processing (ICASSP), pp. 1–5 (2023). IEEE

[12] Zhou, J., Jiang, T., Li, L., Hong, Q., Wang, Z., Xia, B.: Training multi-task

adversarial network for extracting noise-robust speaker embedding. In: ICASSP

2019-2019 IEEE International Conference on Acoustics, Speech and Signal

Processing (ICASSP), pp. 6196–6200 (2019). IEEE

[13] Bullock, L., Bredin, H., Garcia-Perera, L.P.: Overlap-aware diarization: Reseg

mentation using neural end-to-end overlapped speech detection. In: Icassp 2020

2020 IEEE International Conference on Acoustics, Speech and Signal Processing

(ICASSP), pp. 7114–7118 (2020). IEEE

[14] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans

trained by a two time-scale update rule converge to a local nash equilibrium.

Advances in neural information processing systems 30 (2017)

[15] Vinciarelli, A., Valente, F., Yella, S.H., Sapru, A.: Understanding social signals in

multi-party conversations: Automatic recognition of socio-emotional roles in the

ami meeting corpus. In: 2011 IEEE International Conference on Systems, Man,

and Cybernetics, pp. 374–379 (2011). IEEE

11

[16] Zhu, C., Cheng, Y., Gan, Z., Sun, S., Goldstein, T., Liu, J.: Freelb:

Enhanced adversarial training for natural language understanding. arXiv preprint

arXiv:1909.11764 (2019)

[17] Kim, B., Loew, M., Han, D.K., Ko, H.: Deep clustering for improved inter

cluster separability and intra-cluster homogeneity with cohesive loss. IEICE

TRANSACTIONS on Information and Systems 104(5), 776–780 (2021)