Multilevel complex systems approaches to computational linguistics

doctoral dissertation

Kristina Ban (Avtor), Zoran Levnajić (Mentor), Biljana Mileva-Boshkoska (Komentor)

Povzetek

Complex systems are omnipresent in nature, society as well as in human culture. Last few decades saw an increase of interest for their study, particularly by using graph-theoretic methodologies. By identifying systems' units as nodes and modelling interactions between the units as links, the study of complex networks spread to a number of disciplines including sociology, biology and linguistics, to just mention a few. The research done in this doctoral dissertation falls in this context. The core of this doctoral work is the data-driven multilevel analysis of major human languages, which was done in two stages. First, we looked at the speed of growth of Wikipedias in 26 different languages over the span of 15 years. This involved creating and analysing a dataset with 14962 articles, each of which exists in all 26 languages. We found six well-defined clusters of Wikipedias that share common growth patterns, with their make-up robust against the method used for their determination. Interestingly, the identified clusters were found to have little correlation with the respective language families. Rather, our results suggest that growth of Wikipedias is primarily governed by an intricate set of other factors, from culture to information literacy. Second, to approach human languages at another independent level, we gathered a dataset comprising a list of syllables and a list of syllables words in 10 different languages, specifically: English, Dutch, German, Russian, Slovenian, Croatian, French, Spanish, Latin and Basque. These datasets were obtained from recognized repositories for each language and benchmarked in the same way. Syllable networks were created by looking at pairs of syllables that jointly compose at least one word. We then carried out a systematic network analysis, relying on both standard network analysis methods and more recent techniques, such as K-core analysis and graphlet statistics. Research revealed striking similarities between the architectures of syllable networks that belong to the same language family, along with expected differences between the families. Indeed, structures of syllable networks were found to well quantify the linguistic similarities among these 10 languages, exactly as known from classical linguistics. Most interestingly, we found that Basque language, whose classification is as of today still unknown, bares a strong resemblance to Latin, at least when syllable network representation is concerned. Earlier stages of this doctoral work involved comparing the performance of network alignment algorithms, used in bioinformatics for studying protein networks. Several alignment algorithms were compared by scoring their performance on standard protein datasets. It was found that three algorithms, HUBALIGN, L-GRAAL and NATALIE, regularly produce the most topologically and biologically coherent alignments. Due to the change of doctoral adviser, this research topic was abandoned in favour of language/syllable networks. In sum, this doctoral work involved two distinct directions of research in network science, one related to developing the methodology of network analysis (alignment algorithms), and the other devoted to extracting new information from specifically designed datasets (syllable networks). Therefore, the original contribution of this work to science includes both theory and methodology. Future research avenues include advancement along both directions, most interesting being the application of network alignment methods to syllable datasets, which could reveal more precise quantification of structural differences among syllable networks.

Ključne besede

computational statistics;biostatistics;bioinformatics;machine learning;computational linguistics;

Podatki

Jezik:	Angleški jezik
Leto izida:	2018
Tipologija:	2.08 - Doktorska disertacija
Organizacija:	FIŠ - Fakulteta za informacijske študije v Novem mestu
Založnik:	[K. Ban]
UDK:	004.8:519.765:81'322(043.3)
COBISS:	297952768
Št. ogledov:	3952
Št. prenosov:	139
Ocena:	0 (0 glasov)
Metapodatki:

Ostali podatki

Sekundarni jezik:	Slovenski jezik
Sekundarni naslov:	Večnivojski kompleksni pristopi k računalniški lingvistiki
Sekundarni povzetek:	Kompleksni sistemi so vseprisotni v naravi, družbi in v človeški kulturi. V zadnjih nekaj desetletjih se je povečalo zanimanje za njihovo preučevanje, zlasti z uporabo metod teorije grafov. S predstavitvijo enot sistema kot vozlišč in modeliranja interakcij med enotami kot povezav, študija kompleksnih omrežij se razširila na številne discipline, vključno s sociologijo, biologijo in lingvistiko, da bi omenili le nekaj. Raziskovalno delo v tej doktorski disertaciji sodi v ta kontekst. Jedro tega doktorskega dela je večplastna analiza glavnih svetovnih jezikov, ki temelji na podatkih, kar je narejeno v dveh fazah. Najprej smo pogledali hitrost naraščanja Wikipedij v 26 različnih jezikih v obdobju 15 let. To je vključevalo izdelavo in analizo podatkovji s 14962 članki, od katerih vsaki obstaja v vseh 26 jezikih. Našli smo šest jasno opredeljenih klastrov Wikipedij, ki imajo skupne vzorce rasti, njihova sestava pa je presenetljivo robustna glede na metodo njihove določitve. Zanimivo je, da so ugotovljeni klastri zelo malo korelirani z jezikovnimi družinami 26 jezikov. Nasprotno, naši rezultati kažejo, da je rast Wikipedij predvsem določen zapletenim nizom drugih dejavnikov, od kulture do informacijske pismenosti. V drugi smo fazi pristopili k svetovnim jezikom na novem nivoju, in sicer smo zbrali podatkovja s seznamom zlogov in seznamom zlogiziranih besed v desetih različnih jezikih: angleščini, nizozemščini, nemščini, ruščini, slovenščini, hrvaščini, francoščini, španščini, latinščini in baskovščini. Ti nabori podatkov so bili pridobljeni iz priznanih podatkovnih skladišč za vsaki jezik in ustrezno poenoteni. Omrežja zlogov so bila ustvarjena tako, da so pari zlogov, ki skupaj sestavljajo vsaj eno besedo, predstavljeni kot povezan par vozlišč. Nato smo izvedli sistematično analizo omrežij, ki se je opirala na standardne metode analize omrežja ter na novejše tehnike, kot sta analiza K-jedra in statistika grafkov. Raziskava je pokazala presenetljive podobnosti med arhitekturami omrežij zlogov, ki pripadajo isti jezikovni družini, skupaj s pričakovanimi razlikami med različnimi jezikovnimi družinami. Najbolj zanimivo je, da je zlogovna struktura baskovskega jezika, katerega klasifikacija je še danes neznana, močno podobna latinščini.
Sekundarne ključne besede:	računska statistika;biostatistika;bioinformatika;strojno učenje;računalniška lingvistika;
Vrsta dela (COBISS):	Doktorsko delo/naloga
Komentar na gradivo:	Fakulteta za informacijske študije v Novem mestu
Strani:	XX, 135 str.
ID:	10994936