magistrsko delo
Jože Fartek (Author), Aleš Holobar (Mentor)

Abstract

V magistrskem delu smo predstavili osnove razpoznave govorcev. V ta namen smo najprej opisali izračun vokalnih značilnic. Podrobneje smo predstavili metodo izračuna mel-frekvenčnih kepstralnih koeficientov (MFCC) in prednosti metode v primerjavi z ostalimi pristopi. Opisali smo tudi učenje glasovnih modelov in novejši metodi, ki temeljita na supervektorjih. Na podlagi tega smo v nadaljevanju magistrskega dela razvili Androidovo mobilno aplikacijo, ki v realnem času razpoznava govorce. Pri razpoznavi govorcev smo se omejili na razpoznavo le nekaj oseb. Iz zvočnih posnetkov posameznih govorcev smo izračunali MFCC in jih uporabili za učenje glasovnega modela s pomočjo konvolucijske nevronske mreže. Za optimizacijo parametrov smo primerjali, kako različni parametri vplivajo na učenje glasovnega modela. Primerjali smo, kako dolžina zvočnih posnetkov v razponu 0,5–3 sekunde vpliva na uspešnost razpoznave. Ugotovili smo, da uspešnost modela z večanjem dolžine zvočnega posnetka vse do 1,5 sekunde narašča, nato pa se naraščanje ustavi. Pri primerjavi števila MFCC med 16 in 128 uspešnost modela do 48 MFCC narašča, nato pa se naraščanje ustavi. Pri primerjavi nivoja izpuščenih nevronov med 0 in 0,7 dobimo boljšo natančnost modela z večanjem nivoja izpuščenih nevronov do 0,5, nato pa začne uspešnost padati. Glede na primerjavo smo pri učenju glasovnega modela uporabili zvočne posnetke dolžine 1 sekunde, 32 izračunanih MFCC in nivo izpuščenih nevronov 0,4. Pri tem smo dobili 88-odstotno natančnost modela. Pri razpoznavi smo ugotovili, da hitrost govora vpliva na uspešnost razpoznave, medtem ko glasnost govora nanjo ne vpliva. Testiranje smo izvajali na mobilni napravi LG G7 ThinQ. Izračun MFCC na mobilni napravi je v povprečju trajal 170 milisekund, razpoznava z modelom TensorFlow Lite pa le 8 milisekund.

Keywords

razpoznava govorcev;mel-frekvenčni kepstralni koeficienti;konvolucijske nevronske mreže;platforma Android;magistrske naloge;

Data

Language: Slovenian
Year of publishing:
Typology: 2.09 - Master's Thesis
Organization: UM FERI - Faculty of Electrical Engineering and Computer Science
Publisher: [J. Fartek]
UDC: 004.934.8'1(043.2)
COBISS: 98851331 Link will open in a new window
Views: 147
Downloads: 17
Average score: 0 (0 votes)
Metadata: JSON JSON-RDF JSON-LD TURTLE N-TRIPLES XML RDFA MICRODATA DC-XML DC-RDF RDF

Other data

Secondary language: English
Secondary title: Speaker recognition on mobile devices
Secondary abstract: In this master's thesis, we review the basics of speaker recognition. We described how audio feature extraction works. We look more into details how Mel-frequency Cepstral Coefficients feature extraction works and what are its advantages compared to other feature extraction methods. This part is followed by an overview of speaker models and newer methods based on super vectors. Based on this, we have developed a mobile application, which recognizes speakers in real-time. Application was developed for operating system Android. In identifying speakers, we limited recognition to only a few people. Mel-frequency Cepstral Coefficients were extracted from the audio recordings of individual speakers and used to train the speaker model using a convolutional neural network. To get better results in a real-time recognition, we compared how different parameters affect the training of the speaker model. We compared how the length of the audio recording between 0,5 and 3 seconds affects the recognition performance. We found out that the performance of the sound model increases with increasing the length of the audio recording up to 1,5 seconds, and then the increasing stops. We compared speaker model performance by changing the number of MFCC coefficients between 16 and 128. Performance of the modal is increasing up to 48 MFCC coefficients and then the increasing stops. We also compared the affect of neural network dropout rate between 0 and 0,7. The speaker model performance is increasing up to a 0,5 dropout rate and then the performance begins to decline. According to the comparison, for the implemented mobile application we used an audio recordings of one second length, 32 MFCC coefficients and 0,4 for dropout rate. We achieved 88% accuracy of the speaker model. We measured how speech tempo and loudness affect recognition accuracy. The slower and faster we speak the recognition accuracy is decreasing while with loudness the accuracy it’s not affected. We performed testing on LG G7 ThinkQ mobile device and measured that the average time to calculate MFCC coefficients is 170 milliseconds and recognition with the TensorFlow Lite model takes only 8 milliseconds.
Secondary keywords: Speaker recognition;Mel-frequency Cepstral Coefficients;Convolutional neural network;Android;
Type (COBISS): Master's thesis/paper
Thesis comment: Univ. v Mariboru, Fak. za elektrotehniko, računalništvo in informatiko, Računalništvo in informacijske tehnologije
Pages: 1 spletni vir (1 datoteka PDF (X, 64 f.))
ID: 14028537