magistrsko delo
Matej Koplan (Author), Slavko Žitnik (Mentor)

Abstract

V tem delu se ukvarjamo s problemom ekstrakcije seznama oseb s poljubnega spletišča. V ta namen implementiramo spletnega pajka za identifikacijo potencialnih podstrani z osebami in ekstraktor podatkov, ki s poljubne spletne strani izvleče podatke o osebah. Pokažemo, da osnovne metode, kot so primerjava imena s seznamom imen, ne dosežejo sprejemljive natančnosti. Pokažemo, da je analiza strukture seznama in prenos odkritega znanja ključna metoda za izboljšavo rezultatov do stopnje, kjer dosežemo sprejemljiv nivo natančnosti. S pomočjo tega pristopa smo izboljšali F1 mero za 50 % na razvojni in za 35 % na skriti testni množici.

Keywords

splet;ekstrakcija podatkov;avtomatska ekstrakcija podatkov s spleta;fokusirani spletni pajki;strukturirani podatki;nestrukturirani podatki;računalništvo in informatika;magisteriji;

Data

Language: Slovenian
Year of publishing:
Typology: 2.09 - Master's Thesis
Organization: UL FRI - Faculty of Computer and Information Science
Publisher: [M. Koplan]
UDC: 004.738.5(043.2)
COBISS: 83603971 Link will open in a new window
Views: 167
Downloads: 27
Average score: 0 (0 votes)
Metadata: JSON JSON-RDF JSON-LD TURTLE N-TRIPLES XML RDFA MICRODATA DC-XML DC-RDF RDF

Other data

Secondary language: English
Secondary title: Automatic extraction of employee data from corporate websites
Secondary abstract: In this work we tackle the problem of extracting lists of people from corporate websites. For this purpose we implement a web crawler to identify possible subpages with people and a data extractor, which is designed to work on any website. We show that basic methods, such as matching names from a list, don't reach acceptable accuracy. We show that analysing the structure and transfrering the discovered knowledge of a list is crucial in reaching the required level of accuracy. Using this approach we have improved the score of our final results by 50 % in the development and by 35 % in the hidden test set.
Secondary keywords: web;data extraction;automatic web data extraction;focused webcrawlers;structured data;unstructured data;computer science;computer and information science;master's degree;Spletna mesta;Računalništvo;Univerzitetna in visokošolska dela;
Type (COBISS): Master's thesis/paper
Study programme: 1000471
Thesis comment: Univ. v Ljubljani, Fak. za računalništvo in informatiko
Pages: 75 str.
ID: 13748127