diplomsko delo
Abstract
V diplomski nalogi se osredotočamo na postavljanje končnih ločil v povedih. Modeli za postavljanje ločil so uporabni pri urejanju besedil, generiranih s prepoznavanjem govora in potencialno tudi pri popravljanju različnih besedil. Ugotoviti želimo, ali je poved pripovedna, vprašalna ali vzklična in kje se ta poved konča. Končna implementacija za vsako besedo v besedilu napove, ali in katero ločilo ji sledi. Uporabili smo slovenski različici modela BERT, ki sta uspešni pri obdelavi naravnega jezika. Model CroSloEngual BERT, ki je bil naučen na podlagi slovenskega, hrvaškega in angleškega jezika, in model SloBERTa, ki je bil naučen na izključno slovenskem jeziku, smo izpopolnili na pripravljeni učni množici. Rezultati kažejo, da model SloBERTa ločila napoveduje bolje od modela CroSloEngual BERT. Ugotovili smo tudi, da je težko napovedovati klicaje, saj jih v učni množici ni dovolj.
Keywords
globoke nevronske mreže;obdelava naravnega jezika;model BERT;model RoBERTa;napovedovanje končnih ločil;segmentacija stavkov;transformerji;jezikovni model;računalništvo in informatika;univerzitetni študij;diplomske naloge;
Data
Language: |
Slovenian |
Year of publishing: |
2021 |
Typology: |
2.11 - Undergraduate Thesis |
Organization: |
UL FRI - Faculty of Computer and Information Science |
Publisher: |
[N. Velikonja] |
UDC: |
004.8:81'322(043.2) |
COBISS: |
77868291
|
Views: |
286 |
Downloads: |
66 |
Average score: |
0 (0 votes) |
Metadata: |
|
Other data
Secondary language: |
English |
Secondary title: |
Slovene sentence segmentation and punctuation using BERT-like models |
Secondary abstract: |
The thesis focuses on the prediction of final punctuation in sentences. Punctuation prediction models are useful in speech recognition and potentially in correcting various texts. We want to predict where sentences end and whether they end with an exclamation, a period or a question mark. Our implementation predicts whether and what punctuation to place after each word. We used two Slovene variants of BERT model, both successful in natural language processing tasks. The CroSloEngual BERT model has been pretrained on Slovenian, Croatian and English language. We compared it to SloBERTa model, trained exclusively on Slovenian corpora. We fine-tuned these models on prepared data sets. Results show that SloBERTa model is better at predicting punctuation than the CroSloEngual BERT model. Results show that predicting exclamation mark is difficult due to a low number of training instances. |
Secondary keywords: |
deep neural networks;natural language processing;end of sentence punctuation prediction;model RoBERTa;model BERT;sentence segmentation;transformers;language model;computer and information science;diploma;Računalniško jezikoslovje;Ločila;Umetna inteligenca;Računalništvo;Univerzitetna in visokošolska dela; |
Type (COBISS): |
Bachelor thesis/paper |
Study programme: |
1000468 |
Thesis comment: |
Univ. v Ljubljani, Fak. za računalništvo in informatiko |
Pages: |
32 str. |
ID: |
13394698 |