Odkrivanje povezanih računov v veliki množici podatkov

magistrsko delo

Benjamin Novak (Author), Aleksander Sadikov (Mentor)

Abstract

Živimo v obdobju, v katerem pri uporabi svetovnega spleta puščamo sled s svojimi podatki. Podjetja, ki takšne podatke shranjujejo in analizirajo, se zaradi velike količine soočajo z izzivi časovne in prostorske kompleksnosti. Enega takšnih izzivov smo poskušali rešiti v našem magistrskem delu, kjer smo v velikih množicah podatkov iskali pare najbolj podobnih računov. V magistrskem delu smo analizirali časovno učinkovitost in računsko uspešnost metod za iskanje parov primerov z veliko mero podobnosti. Eksperimente smo izvedli na dveh podatkovnih množicah. V delu predstavimo način transformacije podatkov in njihovo predstavitev v redki matriki. To smo v nadaljevanju uporabili v eksperimentih, kjer smo poiskali pare računov z največjo kosinusno podobnostjo z eksaktno metodo vseh parov, metodo LSH in bisekcijskim razvrščanjem z voditelji. Pri tem je bil naš cilj oceniti, katera od omenjenih metod v praksi da najboljše rezultate. Ugotovili smo, da je metoda vseh parov za praktično uporabo zaradi časovne neučinkovitesti neprimerna, uspešnost aproksimacijskih metod pa je odvisna od izbire parametrov. Izkazalo se je, da je metoda LSH povezave nad 80% podobnosti našla v krajšem času, z vidika časovne učinkovitosti pa je za nižje meje mere podobnosti bolj primerno bisekcijsko razvrščanje z voditelji.

Keywords

gručenje v skupine;aproksimacijske metode;časovna učinkovitost;računalništvo;računalništvo in informatika;magisteriji;

Data

Language:	Slovenian
Year of publishing:	2019
Typology:	2.09 - Master's Thesis
Organization:	UL FRI - Faculty of Computer and Information Science
Publisher:	[B. Novak]
UDC:	004(043.2)
COBISS:	1538377155
Views:	569
Downloads:	178
Average score:	0 (0 votes)
Metadata:

Other data

Secondary language:	English
Secondary title:	Detection of linked accounts in a large data set
Secondary abstract:	We live in an era where we leave traces of our personal data using the world wide web. Companies that store and analyze such data are facing the challenges of computational and spatial complexity due to their large quantity. In our master's thesis, we tried to solve one of these challenges by identifying linked accounts in large data sets. We analyzed time complexity and computational efficiency of methods used for searching pairs of highly similar accounts. The experiments were carried out on two data sets. In this paper, we presented data transformation and their presentation in a sparse matrix. Next, we searched for pairs of accounts with the cosine similarity above the threshold with the exact All Pairs method, the Locality-Sensitive Hashing, and Bisecting K-Means. Our goal was to evaluate which of these methods yield the best performance with acceptable processing time. To conclude, we found that the All Pairs method is inadequate for practical use due to its time inefficiency. Performance of approximation methods depends on the choice of parameters. It turned out that the LSH method finds pairs with similarity over 80% in the shortest time, but in case of time complexity Bisecting K-Means is more efficient for the lower limits of the similarity.
Secondary keywords:	clustering;approximation methods;time complexity;similarity measure;computer science;computer and information science;master's degree;
Type (COBISS):	Master's thesis/paper
Study programme:	1000471
Embargo end date (OpenAIRE):	1970-01-01
Thesis comment:	Univ. v Ljubljani, Fak. za računalništvo in informatiko
Pages:	60 str.
ID:	11236756