diplomsko delo
Blaž Bulić (Author), Marko Robnik Šikonja (Mentor)

Abstract

V diplomskem delu smo razvili postopek iskanja novih pomenov besed. Seznam opazovanih besed smo izluščili iz množice za razdvoumljanje pomenov besed. Povedi, ki vsebujejo opazovano besedo, smo pridobili iz podatkovne zbirke novic servisa Event Registry. Besede smo predstavili z vektorji s pomočjo modelov multilingual-BERT-Base, Cased in SloBERTa in jih gručili na več načinov. Rezultate smo primerjali s podatki iz množice za razdvoumljanje in ročno preverili nekaj besed z znanimi semantičnimi premiki. Dobljeni rezultati niso obetavni. Menimo da je glavni razlog neustrezna podatkovna zbirka besedil.

Keywords

pomeni besed;vektorske vložitve besed;gručenje;model BERT;procesiranje naravnega jezika;iskanje pomenov besed;interdisciplinarni študij;univerzitetni študij;diplomske naloge;

Data

Language: Slovenian
Year of publishing:
Typology: 2.11 - Undergraduate Thesis
Organization: UL FRI - Faculty of Computer and Information Science
Publisher: [B. Bulić]
UDC: 004.8:81'322(043.2)
COBISS: 168959747 Link will open in a new window
Views: 70
Downloads: 8
Average score: 0 (0 votes)
Metadata: JSON JSON-RDF JSON-LD TURTLE N-TRIPLES XML RDFA MICRODATA DC-XML DC-RDF RDF

Other data

Secondary language: English
Secondary title: Word sense induction in Slovene using large language models
Secondary abstract: In the thesis, we developed a procedure for discovering new word meanings. We extracted the list of observed words from the word-sense disambiguation dataset. Sentences containing the observed word were obtained from the news database from the Event Registry service. We represented the words with vectors using the models multilingual-BERT-Base, Cased and SloBERTa and clustered them in various ways. We compared the results with the data from the disambiguation dataset and manually checked some words with known semantic shifts. The obtained results are not promising. We believe that the main reason is an unsuitable text database.
Secondary keywords: meanings of words;sentence vector embedding;clustering;BERT;natural language processing;word sense induction;computer science;computer and information science;computer science and mathematics;interdisciplinary studies;diploma;Računalniško jezikoslovje;Računalništvo;Univerzitetna in visokošolska dela;
Type (COBISS): Bachelor thesis/paper
Study programme: 1000407
Embargo end date (OpenAIRE): 1970-01-01
Thesis comment: Univ. v Ljubljani, Fak. za računalništvo in informatiko
Pages: 37 str.
ID: 19937509