Knowledge graph-based document embedding enrichment

diploma thesis

Boshko Koloski (Author), Marko Robnik Šikonja (Mentor), Blaž Škrlj (Co-mentor)

Abstract

Structured and unstructured textual data requires efficient representation for computation and manipulation. Many different methods have been developed to represent text in numerical form. Some of these methods are based only on statistical metrics, and some introduce the concept of word context. Structured textual data about concepts and entities is stored in knowledge graphs for which different numerical representations have been developed. By using the facts about concepts, semantics can be introduced into the representation of documents. We propose an approach that merges the knowledge base induced numerical representation of texts and entities that appear in the texts, induced from knowledge bases. We analyze the proposed method using two use cases. The results show that the use of external knowledge significantly improves the performance of machine learning models. We show that the proposed method outperforms non-enriched representations.

Keywords

knowledge graphs;word embedding;knowledge graph embedding;natural language processing;computer and information science;diploma thesis;

Data

Language:	English
Year of publishing:	2020
Typology:	2.11 - Undergraduate Thesis
Organization:	UL FRI - Faculty of Computer and Information Science
Publisher:	[B. Koloski]
UDC:	004.85:81'322(043.2)
COBISS:	30743555
Views:	1094
Downloads:	251
Average score:	0 (0 votes)
Metadata:

Other data

Secondary language:	Slovenian
Secondary title:	Obogatitev dokumentnih vložitev z grafi znanja
Secondary abstract:	Strukturirani in nestrukturirani tekstovni podatki zahtevajo učinkovito predstavitev za računanje in obdelavo. Za predstavitev besedila v številčni obliki, je bilo razvitih veliko različnih metod. Del teh metod temelji zgolj na statističnih metrikah, nekatere pa uvedejo koncept konteksta besede. Strukturirane tekstovni podatki o konceptih in entitetah so shranjeni v grafih znanja, za katere so bile razvite številne numerične predstavitve. Z uporabo dejstev o konceptih lahko semantiko vnesemo v predstavitev dokumentov. Predlagamo pristop, ki združuje številčno predstavitev besedil in entitet, ki se pojavljajo v besedilih iz baz znanja. Predlagano metodo analiziramo s pomočjo dveh primerov uporabe. Rezultati kažejo, da uporaba zunanjega znanja bistveno izboljša uspešnost modelov strojnega učenja. Poleg tega pokažemo, da predlagana metoda presega neobogatene predstavitve.
Secondary keywords:	podatkovni grafi;vektorske vložitve besed;vložitve podatkovnih grafov;procesiranje naravnega jezika;računalništvo in informatika;univerzitetni študij;diplomske naloge;
Type (COBISS):	Bachelor thesis/paper
Study programme:	1000468
Embargo end date (OpenAIRE):	1970-01-01
Thesis comment:	Univ. v Ljubljani, Fak. za računalništvo in informatiko
Pages:	54 str.
ID:	12033042