A Comprehensive Investigation on Image Caption Generation using Deep Neural Networks

Andreza Batista; Lucas Hilario da  Costa; Demóstenes Rodríguez

pdf

Published: Dec 19, 2022

Andreza Patrícia Batista

a:1:{s:5:"en_US";s:4:"IFMG";}

Lucas Hilario da Costa

UFLA-Universidade Federal de Lavras

Demóstenes Zegarra Rodríguez

UFLA-Universidade Federal de Lavras

Abstract

Currently, Voice over IP (VoIP) is one of the most used communication services, however, its
quality is related to several external factors that cause various types of degradation of the voice signal,
directly affecting the quality of experience (QoE) of users. In order to classify the quality of the voice
signal transmitted in a VoIP communication affected by packet loss, two deep learning network models
(DL - Deep Learning) were implemented. The models were developed using a deep neural network
model (DNN), through which the analysis of the voice signal affected by the packet loss rate (PLR) of
the degraded signals, so it was possible to classify them into four different classs according to the user’s
experience. Thus, two databases were prepared, each containing four distinct classs. One of these was
prepared with the ITU-T P.862 recommendation database files with different packet loss rates, and the
other database was prepared with the ITU-T P.501 recommendation files according to the index MOS of
Mean Opinion Score (MOS) of each degraded file. The results obtained from the model for the database
prepared by the packet loss rate was 94% accuracy in model validation, while the model results for the
database prepared by MOS the result obtained was 91% of accuracy. In a comparison with the results
obtained by the P.563 algorithm and the results obtained by the P.862 algorithm, it was possible to obtain
an average of 53.21% accuracy for the P.563 algorithm in comparison with the classification results of
the algorithm P.862. Through the results obtained, it can be concluded that the generated models were
able to classify the packet loss rate and the MOS index in a non-intrusive way and with a great accuracy
rate. Concluding that the generated models are able to determine the MOS of the degraded voice files
more efficiently than the P.563 algorithm.
Keywords: VoIP, Voice Quality, ITU-T P.862, ITU-T P.563, ITU-T P.501, Deep Learning, Machine
Learning

How to Cite

Batista, A., Costa, L. H. da ., & Rodríguez, D. (2022). A Comprehensive Investigation on Image Caption Generation using Deep Neural Networks. INFOCOMP Journal of Computer Science, 21(2). Retrieved from https://infocomp.dcc.ufla.br/index.php/infocomp/article/view/2482

Issue

Vol. 21 No. 2 (2022): December 2022

Section

Articles

Upon receipt of accepted manuscripts, authors will be invited to complete a copyright license to publish the paper. At least the corresponding author must send the copyright form signed for publication. It is a condition of publication that authors grant an exclusive licence to the the INFOCOMP Journal of Computer Science. This ensures that requests from third parties to reproduce articles are handled efficiently and consistently and will also allow the article to be as widely disseminated as possible. In assigning the copyright license, authors may use their own material in other publications and ensure that the INFOCOMP Journal of Computer Science is acknowledged as the original publication place.

Author Biographies

Andreza Patrícia Batista, a:1:{s:5:"en_US";s:4:"IFMG";}

Master's sudent in the Graduate Program in System and Automation Engineering, Federal University of Lavras, Lavras-MG. Administative Technician of the Laboratories Sector, Federal Institute of Education, Science and Technology of Minas Gerais-Campus Formiga, Formiga-MG.

Lucas Hilario da Costa, UFLA-Universidade Federal de Lavras

Master in System and Automation Engineering in the area of Intelligente Systems, Department of Computer Science at the Federal University of Lavras, Lavras-MG

Demóstenes Zegarra Rodríguez, UFLA-Universidade Federal de Lavras

Adjunct Professor I, Department of Computer Science, Federal University of Lavras, Lavras-MG

References

Bergstra, J. A. and Middelburg, C. Itu-t recommendation g. 107: The e-model, a computational model for use in transmission planning. 2003.

G.107, I.-T. R. The e-model: a computational model for use in transmission planning. June 2015. Acessado: 22 Abril 2019.

Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016.

Karpathy, A., Johnson, J., and Li, F. Visualizing and understanding recurrent networks. CoRR, 1506.02078, 2015.

P.563, I.-T. R. Single-ended method for objective

speech quality assessment in narrow-band telephone applications. Apr. 2004.

P.800, I.-T. R. Methods for subjective determination of transmission quality. Aug. 1996.

Schmidhuber, J. Deep learning in neural networks: An overview. Neural networks, 61:85–117, 2015.

Shu, H., Song, Y., and Zhou, H. Time-frequency performance study on urban sound classification with convolutional neural network. In TENCON 2018-2018 IEEE Region 10 Conference, pages 1713–1717. IEEE, 2018.

Sinam, T., Singh, I. T., Lamabam, P., Devi, N. N., and Nandi, S. A technique for classification of voip flows in udp media streams using voip signalling traffic. In 2014 IEEE International Advance Computing Conference (IACC), pages 354–359, Feb 2014.

Yan, W., Tang, D., and Lin, Y. A data-driven soft sensor modeling method based on deep learning

and its application. IEEE Transactions on Industrial Electronics, 64(5):4237–4245, May 2017.

Article Sidebar

Main Article Content

Abstract

Article Details

Andreza Patrícia Batista, a:1:{s:5:"en_US";s:4:"IFMG";}

Lucas Hilario da Costa, UFLA-Universidade Federal de Lavras

Demóstenes Zegarra Rodríguez, UFLA-Universidade Federal de Lavras

References