Uporaba vektorske vgradnje za inteligentno obdelavo slovenskega besedila

magistrsko delo

Urban Strnišnik (Author), Sašo Karakatič (Mentor)

Abstract

V sklopu magistrske naloge smo se najprej osredotočili v problematiko pridobivanja uporabnega znanja iz nestrukturiranega besedila. Po poročilih IDC je razmerje med strukturiranimi in nestrukturiranimi podatki vsako leto večje. Načinov pridobivanja uporabnega znanja iz nestrukturiranega besedila je več, ena izmed njih so besedne vložitve oz. vektorska vgradnja. Najprej smo se posvetili pregledu tehnik besednih vložitev, kaj to je in kaj z njimi dosežemo. Ugotovili smo, da da izraz besedna vložitev stoji za določitvijo vektorske vrednosti besedi, s katero lahko izvajamo nadaljnje računske operacije. Namen magistrske naloge je bil preizkusiti nekatere algoritme vektorske vgradnje, izdelati lastne modele obdelave besedil in jih nato primerjati z nekaterimi že obstoječimi modeli. Lastne in obstoječe modele obdelave besedil smo nato preizkusili in na podlagi primerjave ugotovili prednosti in slabosti pri uporabi v določenem okolju. V sklopu učenja modelov smo se osredotočili tako v nadzorovane kot tudi v nenadzorovane tehnike učenja. Vhodni korpus podatkov smo pridobili iz pravilnikov štirinajstih slovenskih univerz in fakultet. Iz ugotovljenih rezultatov smo opravili analizo in diskusijo rezultatov, kjer smo dobili odgovore na zastavljena raziskovalna vprašanja, hipoteze pa sprejeli ali zavrnili.

Keywords

besedne vložitve;strojno učenje;obdelava naravnega jezika;klasifikacija besedila;nadzorovano učenje;nenadzorovano učenje;magistrske naloge;

Data

Language:	Slovenian
Year of publishing:	2020
Typology:	2.09 - Master's Thesis
Organization:	UM FERI - Faculty of Electrical Engineering and Computer Science
Publisher:	[U. Strnišnik]
UDC:	004.85:81'4(043.2)
COBISS:	38002947
Views:	441
Downloads:	52
Average score:	0 (0 votes)
Metadata:

Other data

Secondary language:	English
Secondary title:	Use of vector embedding for intelligent processing of slovene text
Secondary abstract:	In this master’s thesis, we first focused on the issue of acquiring useful knowledge from an unstructured text. According to IDC reports, the ratio between structured and unstructured data is increasing every year. There are several ways of acquiring useful knowledge from unstructured text, one of which is word embedding or vector embedding. We first looked at a review of word embedding techniques, what they are, and what we achieve with them. We found that the term word embedding stands for determining the vector value of word with which we can perform further computational operations. The purpose of the master's thesis was to test some vector embedding algorithms, create our own language processing models and then compare them with some existing models. We then tested our own and existing language processing models and, based on the comparison, identified the advantages and disadvantages of using them in a particular environment. As part of model learning, we focused on both supervised and unsupervised learning techniques. The input data corpus was obtained from the rules of fourteen Slovenian universities and faculties. From the results found, we performed an analysis and discussion of the results, where we received answers to the research questions, and the hypotheses were accepted or rejected.
Secondary keywords:	Word embedding;machine learning;fastText;natural language processing;doc2vec;word2vec;text classification;supervised learning;unsupervised learning.;
Type (COBISS):	Master's thesis/paper
Thesis comment:	Univ. v Mariboru, Fak. za elektrotehniko, računalništvo in informatiko, Informatika in tehnologije komuniciranja
Pages:	VII, 66 f.
ID:	11933110