diplomsko delo
Abstract
Ker je relacijske podatkovne baze za hranjenje velike količine izvlečkov iz besedil in generiranje poročil detektorja podobnih vsebin težko horizontalno razširiti, smo za ta namen raziskali možnost uporabe podatkovnih baz NoSQL. Preizkusili smo več podatkovnih baz in izbrali najprimernejšo. Implementirali smo tudi nekaj algoritmov, ki so primerni za ugotavljanje podobnosti v parafraziranih besedilih in temeljijo na tvorjenju izvlečkov iz besedil s pomočjo normaliziranih n-gramov. Te algoritme smo primerjali z algoritmom za tvorjenje izvlečkov, ki se na Univerzi v Mariboru uporablja za detekcijo podobnih dokumentov. Po izbiri najustreznejše podatkovne baze NoSQL in algoritma za tvorjenje izvlečkov, smo implementirali prototip porazdeljenega sistema za ugotavljanje podobnih dokumentov in generiranje poročil detektorja podobnih vsebin.
Keywords
porazdeljeno procesiranje;detektor plagiatov;detekcija;diplomske naloge;
Data
Language: |
Slovenian |
Year of publishing: |
2018 |
Typology: |
2.11 - Undergraduate Thesis |
Organization: |
UM FERI - Faculty of Electrical Engineering and Computer Science |
Publisher: |
J. Fartek |
UDC: |
004.4'415:7.061(043.2) |
COBISS: |
21849878
|
Views: |
640 |
Downloads: |
93 |
Average score: |
0 (0 votes) |
Metadata: |
|
Other data
Secondary language: |
English |
Secondary title: |
Distributed generation of plagiarism detection reports |
Secondary abstract: |
Due to the fact that relational databases for storing large quantities of calculated hashes from documents and generations of plagiarism detection reports of similar content have difficulties extending horizontally, we have explored the possibility of using NoSQL databases for this purpose. We have tested several NoSQL databases and selected the most appropriate one. Furthermore, we have implemented several algorithms that are suitable for searching similarities in paraphrased documents, that are based on generating hashes from documents using normalized n-grams. These algorithms were compared with a hash generation algorithm, used at the University of Maribor to detect similar documents. After selecting the most suitable NoSQL database and hash generation algorithm, we implemented a prototype of distributed computer system for identifying similar documents and generating the detector reports of similar content. |
Secondary keywords: |
distributed processing;MapReduce;NoSQL;text matching; |
URN: |
URN:SI:UM: |
Type (COBISS): |
Bachelor thesis/paper |
Thesis comment: |
Univ. v Mariboru, Fak. za elektrotehniko, računalništvo in informatiko, Računalništvo in informacijske tehnologije |
Pages: |
IX, 56 f. |
ID: |
10955664 |