Odprta ekstrakcija informacij za slovenski jezik

diplomsko delo

Miha Bogataj (Author), Slavko Žitnik (Mentor)

Abstract

Odprta ekstrakcija informacij je proces procesiranja naravnega jezika, ki iz posameznih povedi izvleče možne odvisnosti. Odvisnosti so sestavljene iz semantične trojice, kjer prvi člen predstavlja subjekt o katerem poizvedujemo, relacije, ki opiše, kako se prvi člen navezuje na tretjega, in objekt. Sistem odprte ekstrakcije informacij za slovenščino temelji na metodi na podlagi pravil. Sistem je sestavljen iz predprocesorja in ekstraktorja. Vloga predprocesorja je obdelava vhodnega besedila s pomočjo sistema CLASSLA, ki slovnično analizira poved, lematizacija in izgradnja semantičnega drevesa. Vloga ekstraktorja je, da z uporabo pravil poišče relacije v povedi. Ta pravila so bolj kompleksna kot v angleščini, ker je v slovenščini besedni red bolj prost. Slovenščina pozna tudi več sklanjatev, ki omogočajo bolj točno določitev subjekta in objekta. Med najdenimi ekstrakcijami je možno iskanje na dva načina: iskanje povedi in dopolnjevanje parametrov. Iskanje povedi zahteva izpolnjene vse parametre semantične trojice in vrne seznam povedi, ki ustrezajo iskani semantični trojici. Dopolnjevanje parametrov zahteva dva izpolnjena parametra, od katerih je relacija obvezna. Ta način vrne seznam možnih vrednosti za manjkajoč parameter.

Keywords

ekstrakcija;informacija;slovenščina;računalništvo;univerzitetni študij;diplomske naloge;

Data

Language:	Slovenian
Year of publishing:	2022
Typology:	2.11 - Undergraduate Thesis
Organization:	UL FRI - Faculty of Computer and Information Science
Publisher:	[M. Bogataj]
UDC:	004:81'322(043.2)
COBISS:	105616387
Views:	141
Downloads:	34
Average score:	0 (0 votes)
Metadata:

Other data

Secondary language:	English
Secondary title:	Open information extraction for Slovenian language
Secondary abstract:	Open information extraction is a process of natural language processing that extracts possible dependencies from individual sentences. Dependencies consist of a semantic triple where the first article represents the subject we inquire about, the relations that describe how the first article relates to the third, and the object. The open information extraction system for the Slovenian language is based on a rule-based method. The system consists of a preprocessor and extractor system. The role of the preprocessor is to process input text using the CLASSLA system which grammatically analyzes sentences, lemmatizes, and builds a semantic tree. The role of extractor is to find relationships in sentences using given rules. These rules are more complex than in English because in Slovenian the word order is freer. Slovenian also knows several declensions that enable a more precise definition of the subject and object. It is possible to search for found extractions in two ways: searching for sentences and supplementing the parameters. Sentence search requires that all parameters of the semantic triple are met and returns a list of sentences that match the semantic triple searched for. Complementing the parameters requires two met parameters of which the relation is mandatory. This method returns a list of possible values for the missing parameter.
Secondary keywords:	extraction;information;Slovenian language;computer science;diploma;Obdelava naravnega jezika (računalništvo);Računalniško jezikoslovje;Računalništvo;Univerzitetna in visokošolska dela;
Type (COBISS):	Bachelor thesis/paper
Study programme:	1000468
Thesis comment:	Univ. v Ljubljani, Fak. za računalništvo in informatiko
Pages:	58 str.
ID:	15098307