magistrsko delo
    	
    Abstract
 
V tem delu se ukvarjamo s problemom ekstrakcije seznama oseb s poljubnega spletišča. V ta namen implementiramo spletnega pajka za identifikacijo potencialnih podstrani z osebami in ekstraktor podatkov, ki s poljubne spletne strani izvleče podatke o osebah. 
Pokažemo, da osnovne metode, kot so primerjava imena s seznamom imen, ne dosežejo sprejemljive natančnosti. Pokažemo, da je analiza strukture seznama in prenos odkritega znanja ključna metoda za izboljšavo rezultatov do stopnje, kjer dosežemo sprejemljiv nivo natančnosti. S pomočjo tega pristopa smo izboljšali F1 mero za 50 % na razvojni in za 35 % na skriti testni množici.
    Keywords
 
splet;ekstrakcija podatkov;avtomatska ekstrakcija podatkov s spleta;fokusirani spletni pajki;strukturirani podatki;nestrukturirani podatki;računalništvo in informatika;magisteriji;
    Data
 
    
        
            | Language: | Slovenian | 
        
        
            | Year of publishing: | 2021 | 
            
        
        
            | Typology: | 2.09 - Master's Thesis | 
            
        
            | Organization: | UL FRI - Faculty of Computer and Information Science | 
        
            | Publisher: | [M. Koplan] | 
   
        
            | UDC: | 004.738.5(043.2) | 
   
        
        
            | COBISS: | 83603971   | 
        
        
  
        
            | Views: | 167 | 
        
        
            | Downloads: | 27 | 
        
        
            | Average score: | 0 (0 votes) | 
        
            | Metadata: |                       | 
    
    
    Other data
 
    
        
            | Secondary language: | English | 
        
        
            | Secondary title: | Automatic extraction of employee data from corporate websites | 
        
        
        
            | Secondary abstract: | In this work we tackle the problem of extracting lists of people from corporate websites. For this purpose we implement a web crawler to identify possible subpages with people and a data extractor, which is designed to work on any website. 
We show that basic methods, such as matching names from a list, don't reach acceptable accuracy. We show that analysing the structure and transfrering the discovered knowledge of a list is crucial in reaching the required level of accuracy. Using this approach we have improved the score of our final results by 50 % in the development and by 35 % in the hidden test set. | 
        
        
            | Secondary keywords: | web;data extraction;automatic web data extraction;focused webcrawlers;structured data;unstructured data;computer science;computer and information science;master's degree;Spletna mesta;Računalništvo;Univerzitetna in visokošolska dela; | 
        
            
        
            | Type (COBISS): | Master's thesis/paper | 
        
        
            | Study programme: | 1000471 | 
        
           
        
           
        
           
        
           
        
            | Thesis comment: | Univ. v Ljubljani, Fak. za računalništvo in informatiko | 
        
           
        
           
        
           
        
            | Pages: | 75 str. | 
        
           
        
           
        
           
        
           
        
           
        
           
        
           
        
           
        
          
        
          
        
          
        
         
        
         
        
        
            | ID: | 13748127 |