diplomsko delo
Urban Knupleš (Author), Aleš Holobar (Mentor), Marko Ferme (Co-mentor)

Abstract

Nestrukturirani dokumenti zajemajo informacije v oblikah in postavitvah, ki se lahko od enega primerka do drugega razlikujejo, kar lahko oteži in podraži nalogo pridobivanja informacij. Kot rešitev se je v zadnjih letih za razumevanje dokumentov na področju dokumentne inteligence pričela uporaba nevronskih jezikovnih modelov, usposobljenih na učnih množicah dokumentov. V diplomskem delu za pridobivanje informacij iz skeniranih trgovinskih računov uporabljamo prehodno učeni nevronski jezikovni model, zgrajen iz transformatorjev. Model je natančno učen z uporabo učne množice SROIE za izluščitev štirih kategorij, tj. imen in naslovov trgovin, datumov in skupnih cen. Za pridobivanje informacij smo uporabili prepoznavo imenskih entitet. Za primerjavo izvajamo poskuse s spreminjanem hiperparametrov modela. S spremembo nevronskega jezikovnega modela smo pri poskusih dosegli največjo natančnost klasifikacije: 96,7 %.

Keywords

dokumentna inteligenca;obdelava naravnih jezikov;prepoznava imenskih entitet;jezikovni modeli;transformatorji;diplomske naloge;

Data

Language: Slovenian
Year of publishing:
Typology: 2.11 - Undergraduate Thesis
Organization: UM FERI - Faculty of Electrical Engineering and Computer Science
Publisher: [U. Knupleš]
UDC: 004.652.8(043.2)
COBISS: 95975171 Link will open in a new window
Views: 274
Downloads: 20
Average score: 0 (0 votes)
Metadata: JSON JSON-RDF JSON-LD TURTLE N-TRIPLES XML RDFA MICRODATA DC-XML DC-RDF RDF

Other data

Secondary language: English
Secondary title: Named entity recognition on unstructured documents using neural language models
Secondary abstract: Layouts and formats of information, in unstructured documents, can differ from one another and can make the extraction of information difficult and costly. Therefore, in recent years, the field of document intelligence began with the usage of neural language models trained on datasets of documents for document understanding. In the thesis, we adopt a pre-trained neural language model based on transformers, for information extraction out of scanned store invoices. The model is fine-tuned, using the SROIE dataset, based on four categories to extract store names and addresses, dates and total prices. For information extraction we used named entity recognition to classify tokens into the four prementioned categories. We conducted experiments using altered hyperparameters of the model for comparison. With the usage of the fine-tuned, altered neural language model, we achieved a maximum classification accuracy score of 96.7 %.
Secondary keywords: Document intelligence;natural language processing;named entity recognition;langauge models;transformers;
Type (COBISS): Bachelor thesis/paper
Thesis comment: Univ. v Mariboru, Fak. za elektrotehniko, računalništvo in informatiko, Računalništvo in informacijske tehnologije
Pages: IX, 35 str.
ID: 13344285