Označevanje skupin dokumentov z uporabo vložitev besed

diplomsko delo

Nikola Đukić (Avtor), Blaž Zupan (Mentor)

Povzetek

Dokumente lahko na različne načine predstavimo z vektorji ter jih vizualiziramo v dvorazsežnem prostoru. V tem prostoru lahko poiščemo skupine podobnih dokumentov in nato poiščemo besede, ki dobro opisujejo posamezne skupine. Vizualizacijo dokumentov lahko obogatimo s prikazom najdenih besed. Za to se uporabljajo metode za označevanje skupin dokumentov, ki temeljijo na uporabi mer pomembnosti, ki upoštevajo le frekvence besed v danem korpusu. V tem diplomskem delu predlagamo novo metodo za označevanje skupin dokumentov, ki za vložitev dokumentov in besed uporablja prednaučene modele za vložitev besed ter temelji na predpostavki, da so podobne besede predstavljene s podobnimi vektorji. Modele za vložitev besed med sabo primerjamo s stališča medsebojne podobnosti in uspešnosti na klasifikacijskih nalogah, da bi izbrali tistega, ki ga bomo uporabili v kombinaciji z metodo za označevanje skupin dokumentov. Metodo empirično ovrednotimo ter jo primerjamo z že obstoječim pristopom in pokažemo, da zaradi uporabe prednaučenih modelov lahko uspešno dela tudi na zelo majhnih podatkovnih množicah, česar že obstoječi pristop ne zmore.

Ključne besede

vložitve besed;vizualizacija;gručenje;računalništvo in informatika;univerzitetni študij;diplomske naloge;

Podatki

Jezik:	Slovenski jezik
Leto izida:	2020
Tipologija:	2.11 - Diplomsko delo
Organizacija:	UL FRI - Fakulteta za računalništvo in informatiko
Založnik:	[N. Đukić]
UDK:	004(043.2)
COBISS:	31040003
Št. ogledov:	581
Št. prenosov:	105
Ocena:	0 (0 glasov)
Metapodatki:

Ostali podatki

Sekundarni jezik:	Angleški jezik
Sekundarni naslov:	Labeling document clusters using word embeddings
Sekundarni povzetek:	Documents can be represented as vectors in various ways and visualized in two-dimensional space. In that space, we can find clusters of similar documents and the words that describe each cluster as well as possible. Those words can be added to the visualization to enrich it. This can be achieved by using methods for labeling document clusters. These methods use the frequencies of words in a given corpus to measure the importance of each word. In this thesis we propose a novel method for labeling clusters of documents. The method is based on using pre-trained word embedding models to embed both words and documents and utilizes the assumption that the similar words are represented with similar vectors. We compare word embedding models by computing their similarities and scores achieved on classification tasks to choose the one to use in combination with our method. Method is empirically evaluated and compared with the traditional approach. We show that compared to the traditional approach, our method can work on very small datasets due to the fact that it uses the pre-trained models to obtain the embeddings.
Sekundarne ključne besede:	word embeddings;visualization;clustering;computer and information science;diploma thesis;
Vrsta dela (COBISS):	Diplomsko delo/naloga
Študijski program:	1000468
Konec prepovedi (OpenAIRE):	1970-01-01
Komentar na gradivo:	Univ. v Ljubljani, Fak. za računalništvo in informatiko
Strani:	45 str.
ID:	12033206