Abstract
Razvrščanje podatkov v gruče je slabo pogojeni problem in dokazano je, da algoritem,
ki bi izpolnjeval vse predpostavke dobrega razvrščanja, ne obstaja. To je glavni razlog
za obstoj velikega števila algoritmov za razvrščanje, ki temeljijo na raznovrstnih teoretičnih
osnovah – med njimi je tudi znan algoritem Kohonenove samo-organizirajoče
mreže (SOM). Na žalost nam naučena mreža SOM ne ponudi eksplicitno izražene
strukture gruč v podatkih, zato navadno posegamo uporabimo dodaten korak, na katerem
združujemo posamezne enote mreže v gruče. V disertaciji predstavljamo doprinos
k dvonivojskemu razvrščanju z mrežo SOM, pri čemer uporabljamo principe zakona
gravitacije. Predlagan algoritem za gravitacijsko razvrščanje samo-organizirajoče mreže
(gSOM) je sposoben odkriti gruče zapletene in ne zgolj hipersferične oblike. Poleg
tega algoritem gSOM sam določi število gruč v podatkih. Opravili smo primerjavo z
nekaterimi drugimi tehnikami razvrščanja na umetnih in realnih podatkih. Izkaže se,
da gSOM doseže obetavne rezultate, še posebej na podatkih o izraženosti genov.
Algoritem, ki bi znal rešiti vse probleme razvrščanja ne obstaja, zato je koristno analizirati
podatke skozi večkratno razvrščanje. Pri tem nastane množica razvrstitev in
tvorijo ansambel razvrstitev. Metode ansamblov za razvrščanje so se pojavile nedavno
kot učinkovit pristop k stabilizaciji in izboljšanju delovanja enostavnih algoritmov za
razvrščanje. Razvrščanje z ansambli je v osnovi sestavljeno iz dveh korakov: gradnja
ansambla razvrstitev z enostavnimi metodami in združevanje dobljenih rešitev v sporazumno
razvrstitev podatkov. Da bi olajšali korak združevanja v sporazum, je bil predlagan
postopek uteževanja razvrstitev v ansamblu, ki skuša ovrednotiti pomembnost
posameznih članov ansambla. Eden od načinov za analizo pomembnosti razvrstitev
(PRA) je uporaba notranjih kazalcev veljavnosti razvrstitev. Na tem področju smo napravili
dva prispevka: najprej predlagamo nov notranji ocenjevalni kazalec, imenovan
DNs, ki razširja Dunnov kazalec in je osnovan na iskanju najkrajših poti v Gabrielovem grafu nad podatki; drugi prispevek je povezan z nadgradnjo obstoječega pristopa uteženega
ansambla z dodatnim korakom redukcije, ki sledi koraku ocenjevanja razvrstitev
v ansamblu. Razvit postopek analize pomembnosti razvrstitev z redukcijo (PRAr) se
obnese zadovoljivo, ko ga vključimo v tri funkcije za iskanje sporazumne razvrstitve,
pri čemer vse funkcije temeljijo na principu kopičenja dokazov.
V disertaciji se dotikamo vseh glavnih področij razvrščanja podatkov: ustvarjanje
podatkov, analiza podatkov z enostavnimi algoritmi za razvrščanje, ocenjevanje razvrstitev
z notranjimi in zunanjimi kazalci veljavnosti ter razvrščanje z ansambli s poudarkom
na uteženih različicah. Vse predlagane doprinose smo primerjali s trenutno
aktualnimi metodami na podatkih iz različnih problemskih domen. Rezultati kažejo
na uporabnost predlaganih metod v kontekstu strojnega učenja.
Keywords
cluster analysis;unsupervised learning;weighted cluster ensemble;cluster validation;synthetic data generation;computer and information science;doctoral dissertations;
Data
Language: |
English |
Year of publishing: |
2016 |
Typology: |
2.08 - Doctoral Dissertation |
Organization: |
UL FRI - Faculty of Computer and Information Science |
Publisher: |
[N. Ilc] |
UDC: |
004.6(043.3) |
COBISS: |
1537246403
|
Views: |
1483 |
Downloads: |
849 |
Average score: |
0 (0 votes) |
Metadata: |
|
Other data
Secondary language: |
Slovenian |
Secondary title: |
Clustering based on weighted ensemble |
Secondary abstract: |
The clustering is an ill-posed problem and it has been proven that there is no algorithm
that would satisfy all the assumptions about good clustering. This is why numerous
clustering algorithms exist, based on various theories and approaches, one of
them being the well-known Kohonen’s self-organizing map (SOM). Unfortunately,
after training the SOM there is no explicitly obtained information about clusters in
the underlying data, so another technique for grouping SOM units has to be applied
afterwards. In the thesis, a contribution towards a two-level clustering of the SOM
is presented, employing principles of Gravitational Law. The proposed algorithm for
gravitational clustering of the SOM (gSOM) is capable of discovering complex cluster
shapes, not only limited to the spherical ones, and is able to automatically determine
the number of clusters. Experimental comparison with other clustering techniques is
conducted on synthetic and real-world data. We show that gSOM achieves promising
results especially on gene-expression data.
As there is no clustering algorithm that can solve all the problems, it turns out as
very beneficial to analyse the data using multiple partitions of them – an ensemble of
partitions. Cluster-ensemble methods have emerged recently as an effective approach
to stabilize and boost the performance of the single-clustering algorithms. Basically,
data clustering with an ensemble involves two steps: generation of the ensemble with
single-clustering methods and the combination of the obtained solutions to produce a
final consensus partition of the data. To alleviate the consensus step the weighted cluster
ensemble was proposed that tries to assess the relevance of ensemble members. One
way to achieve this is to employ internal cluster validity indices to perform partition
relevance analysis (PRA). Our contribution here is two-fold: first, we propose a novel
cluster validity index DNs that extends the Dunn’s index and is based on the shortest
paths between the data points considering the Gabriel graph on the data; second, we propose an enhancement to the weighted cluster ensemble approach by introducing the
reduction step after the assessment of the ensemble partitions is done. The developed
partition relevance analysis with the reduction step (PRAr) yields promising results
when plugged in the three consensus functions, based on the evidence accumulation
principle.
In the thesis we address all the major stages of data clustering: data generation, data
analysis using single-clustering algorithms, cluster validity using internal end external
indices, and finally the cluster ensemble approach with the focus on the weighted variants.
All the contributions are compared to the state-of-art methods using datasets
from various problem domains. Results are positive and encourage the inclusion of
the proposed algorithms in the machine-learning practitioner’s toolbox. |
Secondary keywords: |
razvrščanje v gruče;nenadzorovano učenje;metode utežnega ansambla;ocenjevanje razvrstitev;generator umetnih podatkov;računalništvo in informatika;doktorske disertacije;Podatki;Disertacije;Razvrščanje; |
File type: |
application/pdf |
Type (COBISS): |
Doctoral dissertation |
Study programme: |
1000474 |
Embargo end date (OpenAIRE): |
1970-01-01 |
Thesis comment: |
Univ. v Ljubljani, Fak. za računalništvo in informatiko |
Pages: |
IX, 212 str. |
ID: |
9214066 |