diplomsko delo

Abstract

V diplomski nalogi se osredotočamo na postavljanje končnih ločil v povedih. Modeli za postavljanje ločil so uporabni pri urejanju besedil, generiranih s prepoznavanjem govora in potencialno tudi pri popravljanju različnih besedil. Ugotoviti želimo, ali je poved pripovedna, vprašalna ali vzklična in kje se ta poved konča. Končna implementacija za vsako besedo v besedilu napove, ali in katero ločilo ji sledi. Uporabili smo slovenski različici modela BERT, ki sta uspešni pri obdelavi naravnega jezika. Model CroSloEngual BERT, ki je bil naučen na podlagi slovenskega, hrvaškega in angleškega jezika, in model SloBERTa, ki je bil naučen na izključno slovenskem jeziku, smo izpopolnili na pripravljeni učni množici. Rezultati kažejo, da model SloBERTa ločila napoveduje bolje od modela CroSloEngual BERT. Ugotovili smo tudi, da je težko napovedovati klicaje, saj jih v učni množici ni dovolj.

Keywords

globoke nevronske mreže;obdelava naravnega jezika;model BERT;model RoBERTa;napovedovanje končnih ločil;segmentacija stavkov;transformerji;jezikovni model;računalništvo in informatika;univerzitetni študij;diplomske naloge;

Data

Language: Slovenian
Year of publishing:
Typology: 2.11 - Undergraduate Thesis
Organization: UL FRI - Faculty of Computer and Information Science
Publisher: [N. Velikonja]
UDC: 004.8:81'322(043.2)
COBISS: 77868291 Link will open in a new window
Views: 286
Downloads: 66
Average score: 0 (0 votes)
Metadata: JSON JSON-RDF JSON-LD TURTLE N-TRIPLES XML RDFA MICRODATA DC-XML DC-RDF RDF

Other data

Secondary language: English
Secondary title: Slovene sentence segmentation and punctuation using BERT-like models
Secondary abstract: The thesis focuses on the prediction of final punctuation in sentences. Punctuation prediction models are useful in speech recognition and potentially in correcting various texts. We want to predict where sentences end and whether they end with an exclamation, a period or a question mark. Our implementation predicts whether and what punctuation to place after each word. We used two Slovene variants of BERT model, both successful in natural language processing tasks. The CroSloEngual BERT model has been pretrained on Slovenian, Croatian and English language. We compared it to SloBERTa model, trained exclusively on Slovenian corpora. We fine-tuned these models on prepared data sets. Results show that SloBERTa model is better at predicting punctuation than the CroSloEngual BERT model. Results show that predicting exclamation mark is difficult due to a low number of training instances.
Secondary keywords: deep neural networks;natural language processing;end of sentence punctuation prediction;model RoBERTa;model BERT;sentence segmentation;transformers;language model;computer and information science;diploma;Računalniško jezikoslovje;Ločila;Umetna inteligenca;Računalništvo;Univerzitetna in visokošolska dela;
Type (COBISS): Bachelor thesis/paper
Study programme: 1000468
Thesis comment: Univ. v Ljubljani, Fak. za računalništvo in informatiko
Pages: 32 str.
ID: 13394698