A Comprehensive Investigation on Image Caption Generation using Deep Neural Networks

Main Article Content

Andreza Patrícia Batista
Lucas Hilario da Costa
Demóstenes Zegarra Rodríguez


Currently, Voice over IP (VoIP) is one of the most used communication services, however, its
quality is related to several external factors that cause various types of degradation of the voice signal,
directly affecting the quality of experience (QoE) of users. In order to classify the quality of the voice
signal transmitted in a VoIP communication affected by packet loss, two deep learning network models
(DL - Deep Learning) were implemented. The models were developed using a deep neural network
model (DNN), through which the analysis of the voice signal affected by the packet loss rate (PLR) of
the degraded signals, so it was possible to classify them into four different classs according to the user’s
experience. Thus, two databases were prepared, each containing four distinct classs. One of these was
prepared with the ITU-T P.862 recommendation database files with different packet loss rates, and the
other database was prepared with the ITU-T P.501 recommendation files according to the index MOS of
Mean Opinion Score (MOS) of each degraded file. The results obtained from the model for the database
prepared by the packet loss rate was 94% accuracy in model validation, while the model results for the
database prepared by MOS the result obtained was 91% of accuracy. In a comparison with the results
obtained by the P.563 algorithm and the results obtained by the P.862 algorithm, it was possible to obtain
an average of 53.21% accuracy for the P.563 algorithm in comparison with the classification results of
the algorithm P.862. Through the results obtained, it can be concluded that the generated models were
able to classify the packet loss rate and the MOS index in a non-intrusive way and with a great accuracy
rate. Concluding that the generated models are able to determine the MOS of the degraded voice files
more efficiently than the P.563 algorithm.
Keywords: VoIP, Voice Quality, ITU-T P.862, ITU-T P.563, ITU-T P.501, Deep Learning, Machine

Article Details

How to Cite
Batista, A., Costa, L. H. da ., & Rodríguez, D. (2022). A Comprehensive Investigation on Image Caption Generation using Deep Neural Networks. INFOCOMP Journal of Computer Science, 21(2). Retrieved from https://infocomp.dcc.ufla.br/index.php/infocomp/article/view/2482
Author Biographies

Andreza Patrícia Batista, a:1:{s:5:"en_US";s:4:"IFMG";}

Master's sudent in the Graduate Program in System and Automation Engineering, Federal University of Lavras, Lavras-MG. Administative Technician of the Laboratories Sector, Federal Institute of Education, Science and Technology of Minas Gerais-Campus Formiga, Formiga-MG.

Lucas Hilario da Costa, UFLA-Universidade Federal de Lavras

Master in System and Automation Engineering in the area of Intelligente Systems, Department of Computer Science at the Federal University of Lavras, Lavras-MG 

Demóstenes Zegarra Rodríguez, UFLA-Universidade Federal de Lavras

Adjunct Professor I, Department of Computer Science, Federal University of Lavras, Lavras-MG


Bergstra, J. A. and Middelburg, C. Itu-t recommendation g. 107: The e-model, a computational model for use in transmission planning. 2003.

G.107, I.-T. R. The e-model: a computational model for use in transmission planning. June 2015. Acessado: 22 Abril 2019.

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016.

Karpathy, A., Johnson, J., and Li, F. Visualizing and understanding recurrent networks. CoRR, 1506.02078, 2015.

P.563, I.-T. R. Single-ended method for objective

speech quality assessment in narrow-band telephone applications. Apr. 2004.

P.800, I.-T. R. Methods for subjective determination of transmission quality. Aug. 1996.

Schmidhuber, J. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.

Shu, H., Song, Y., and Zhou, H. Time-frequency performance study on urban sound classification with convolutional neural network. In TENCON 2018-2018 IEEE Region 10 Conference, pages 1713–1717. IEEE, 2018.

Sinam, T., Singh, I. T., Lamabam, P., Devi, N. N., and Nandi, S. A technique for classification of voip flows in udp media streams using voip signalling traffic. In 2014 IEEE International Advance Computing Conference (IACC), pages 354–359, Feb 2014.

Yan, W., Tang, D., and Lin, Y. A data-driven soft sensor modeling method based on deep learning

and its application. IEEE Transactions on Industrial Electronics, 64(5):4237–4245, May 2017.