diplomsko delo
Jože Fartek (Author), Milan Ojsteršek (Mentor)

Abstract

Ker je relacijske podatkovne baze za hranjenje velike količine izvlečkov iz besedil in generiranje poročil detektorja podobnih vsebin težko horizontalno razširiti, smo za ta namen raziskali možnost uporabe podatkovnih baz NoSQL. Preizkusili smo več podatkovnih baz in izbrali najprimernejšo. Implementirali smo tudi nekaj algoritmov, ki so primerni za ugotavljanje podobnosti v parafraziranih besedilih in temeljijo na tvorjenju izvlečkov iz besedil s pomočjo normaliziranih n-gramov. Te algoritme smo primerjali z algoritmom za tvorjenje izvlečkov, ki se na Univerzi v Mariboru uporablja za detekcijo podobnih dokumentov. Po izbiri najustreznejše podatkovne baze NoSQL in algoritma za tvorjenje izvlečkov, smo implementirali prototip porazdeljenega sistema za ugotavljanje podobnih dokumentov in generiranje poročil detektorja podobnih vsebin.

Keywords

porazdeljeno procesiranje;detektor plagiatov;detekcija;diplomske naloge;

Data

Language: Slovenian
Year of publishing:
Typology: 2.11 - Undergraduate Thesis
Organization: UM FERI - Faculty of Electrical Engineering and Computer Science
Publisher: J. Fartek
UDC: 004.4'415:7.061(043.2)
COBISS: 21849878 Link will open in a new window
Views: 640
Downloads: 93
Average score: 0 (0 votes)
Metadata: JSON JSON-RDF JSON-LD TURTLE N-TRIPLES XML RDFA MICRODATA DC-XML DC-RDF RDF

Other data

Secondary language: English
Secondary title: Distributed generation of plagiarism detection reports
Secondary abstract: Due to the fact that relational databases for storing large quantities of calculated hashes from documents and generations of plagiarism detection reports of similar content have difficulties extending horizontally, we have explored the possibility of using NoSQL databases for this purpose. We have tested several NoSQL databases and selected the most appropriate one. Furthermore, we have implemented several algorithms that are suitable for searching similarities in paraphrased documents, that are based on generating hashes from documents using normalized n-grams. These algorithms were compared with a hash generation algorithm, used at the University of Maribor to detect similar documents. After selecting the most suitable NoSQL database and hash generation algorithm, we implemented a prototype of distributed computer system for identifying similar documents and generating the detector reports of similar content.
Secondary keywords: distributed processing;MapReduce;NoSQL;text matching;
URN: URN:SI:UM:
Type (COBISS): Bachelor thesis/paper
Thesis comment: Univ. v Mariboru, Fak. za elektrotehniko, računalništvo in informatiko, Računalništvo in informacijske tehnologije
Pages: IX, 56 f.
ID: 10955664