master's thesis
Abstract
Hierarchical clustering is an unsupervised data mining technique that infers a set of nested, hierarchically organised clusters. Even slight permutations in the data can change the clustering structure. Ideally, we should only be interested in the stable part of the clustering hierarchy. It is thus essential to assess the stability of the nodes in the hierarchy. In this thesis, we review the approaches to determine the stability and statistical significance of the clusters. While all the reviewed methods use resampling, their results could be substantially different because of the details in the implementation and stability scoring. The approach called pvclust is recently most used in practical applications. In its R implementation, it suffers from low speed and visualisation of results. We have implemented pvclust in Python, yielding an implementation that is almost an order of magnitude faster than the version in R. Our implementation is currently the only opensource Python implementation of stability analysis of hierarchical clustering. To visualise the results and enable interactive explorative data analysis, we also incorporated our implementation in the Orange data mining toolbox.
Keywords
hierarchical clustering;stability;dendrogram;unsupervised learning;computer science;computer and information science;master's degree;
Data
Language: |
English |
Year of publishing: |
2020 |
Typology: |
2.09 - Master's Thesis |
Organization: |
UL FRI - Faculty of Computer and Information Science |
Publisher: |
[A. Turanjanin] |
UDC: |
004.8(043.2) |
COBISS: |
32934403
|
Views: |
1115 |
Downloads: |
212 |
Average score: |
0 (0 votes) |
Metadata: |
|
Other data
Secondary language: |
Slovenian |
Secondary title: |
Stabilnost hierarhičnega razvrščanja v skupine |
Secondary abstract: |
Hierarhično gručenje je nenadzorovana metoda učenja, ki išče vgnezdene, hierarhično organizirane skupine v podatkih. Njena šibkost je občutljivost na majhne permutacije v podatkih, ki lahko povzrčijo velike spremembe v strukturi gručenja. V idealnem primeru nas zanima le stabilen del hierarhije, za kar pa moramo oceniti stabilnost vozlišč. V tej nalogi smo pregledali pristope za ugotavljanje stabilnosti in statistične pomembnosti gruč. Čeprav vse pregledane metode uporabljajo ponovno vzorčenje, se lahko njihovi rezultati bistveno razlikujejo zaradi podrobnosti pri izvajanju in računanju stabilnosti. Metoda imenovana pvclust, se v zadnjem času najpogosteje uporablja v praktičnih aplikacijah. Njena implementacija v R je počasna, vizualizacija dobljenih rezultatov pa slaba. V Pythonu smo implementirali pvclust metodo, in naša izvedba je skoraj za red velikosti hitrejša od različice v R. Naša implementacija je trenutno edina open-source Python implementacija
za analizo stabilnosti hierarhičnega združevanja v gruče. Da bi vizualizirali rezultate in omogočili interaktivno analizo raziskovalnih podatkov, smo implementacijo vključili v orodje za podatkovno rudarjenje Orange. |
Secondary keywords: |
hierarhično razvrščanje;stabilnost;dendrogram;nenadzorovano učenje;računalništvo;računalništvo in informatika;magisteriji; |
Type (COBISS): |
Master's thesis/paper |
Study programme: |
1000471 |
Embargo end date (OpenAIRE): |
1970-01-01 |
Thesis comment: |
Univ. v Ljubljani, Fak. za računalništvo in informatiko |
Pages: |
VI, 49 str. |
ID: |
12050939 |