magistrsko delo
Alja Debeljak (Author), Marko Robnik Šikonja (Mentor), Kaja Dobrovoljc (Co-mentor)

Abstract

Parafraziranje je pomembna naloga na področju obdelave naravnega jezika, saj vključuje tvorjenje povedi, ki se od izvorne razlikujejo po obliki, vendar ohranjajo enak pomen. Avtomatsko ustvarjanje raznolikih in razumljivih parafraz prispeva k lažjemu razumevanju in interpretaciji besedil ter izboljšuje komunikacijo med človekom in računalnikom. V tej nalogi smo razvili model za parafraziranje v slovenščini, ki temelji na vnaprej naučenih velikih generativnih jezikovnih modelih. Zaradi računske zahtevnosti velikih modelov smo izbrali manjšo različico večjezikovnega modela mT5 in slovenskega modela SloT5. Temeljita na arhitekturi transformer, ki trenutno prevladuje na področju obdelave naravnega jezika. Iz množice podnapisov OpenSubtitles2018 smo pridobili slovenske in angleške podnapise, angleške smo prevedli v slovenščino in tako ustvarili učno množico s poravnanimi slovenskimi parafrazami. Množica je uporabna za nadaljnje raziskave ter gradnjo modelov za generiranje slovenskih parafraz. Uporabili smo jo za prilagoditev modelov, ki smo ju ovrednotili z metrikama ROUGE in BERTScore ter kvalitativno s človeško presojo. Model SloT5 je dosegel boljše rezultate. Z analizo ustvarjenih parafraz smo opredelili glavne strategije parafraziranja v slovenščini ter najpogostejše napake.

Keywords

digitalno jezikoslovje;obdelava naravega jezika;veliki jezikovni modeli;generiranje parafraz;

Data

Language: Slovenian
Year of publishing:
Typology: 2.09 - Master's Thesis
Organization: UL PEF - Faculty of Education
Publisher: [A. Debeljak Šokić]
UDC: 004.4:81'322.2(043.2)
COBISS: 227906051 Link will open in a new window
Views: 69
Downloads: 10
Average score: 0 (0 votes)
Metadata: JSON JSON-RDF JSON-LD TURTLE N-TRIPLES XML RDFA MICRODATA DC-XML DC-RDF RDF

Other data

Secondary language: English
Secondary title: [Generating paraphrases in Slovene using machine learning]
Secondary abstract: Paraphrasing is an important task in natural language processing, involving the generation of expressions that differ in form from the original text while preserving its meaning. Automatically generating versatile and comprehensible paraphrases enhances text understanding and interpretation, and also improves human-computer interaction. We developed a paraphrasing model for Slovene, leveraging pre-trained models. Due to the computational complexity of large models, we selected a smaller version of the multilingual mT5 model and the Slovene SloT5 model, both of which are based on the transformer architecture which currently prevails in the field of natural language processing. Using the OpenSubtitles2018 dataset, we obtained Slovene and English subtitles, translating the English subtitles into Slovene to create a training set with aligned Slovene paraphrases. The dataset can be used for future research and developing models for generating Slovene paraphrases. We fine-tuned the models using this dataset and evaluated their performance with ROUGE and BERTScore metrics, as well as qualitative human judgment. The SloT5 model produced better results. By analyzing the generated paraphrases, we identified key paraphrasing strategies in Slovene and the most common errors.
Secondary keywords: digital linguistics;natural language processing;large language models;paraphrase generation;Kognitivna znanost;Strojno prevajanje;Univerzitetna in visokošolska dela;
Type (COBISS): Master's thesis/paper
Study programme: 0
Embargo end date (OpenAIRE): 1970-01-01
Thesis comment: Univ. v Ljubljani, skupni interdsciplinarni program druge stopnje Kognitivna znanost, v sodelovanju z Universität Wien, Univerzita Komenského v Bratislave in Eötvös Loránd Tudományegyetem
Pages: 1 spletni vir (1 datoteka PDF (65 str.))
ID: 25980359
Recommended works:
, Bayesian attention networks for reliable hate speech detection