diplomsko delo
Peter Grlica (Avtor), Dejan Lavbič (Mentor)

Povzetek

Tehnike spletnega luščenja podatkov

Ključne besede

PHP;Mink;luščenje podatkov;AJAX;anonimizacija;zakonodaja;avtomatizacija;odprta koda;računalništvo;visokošolski strokovni študij;računalništvo in informatika;diplomske naloge;

Podatki

Jezik: Slovenski jezik
Leto izida:
Tipologija: 2.11 - Diplomsko delo
Organizacija: UL FRI - Fakulteta za računalništvo in informatiko
Založnik: [P. Grlica]
UDK: 004.774.2(043.2)
COBISS: 9990484 Povezava se bo odprla v novem oknu
Št. ogledov: 68
Št. prenosov: 3
Ocena: 0 (0 glasov)
Metapodatki: JSON JSON-RDF JSON-LD TURTLE N-TRIPLES XML RDFA MICRODATA DC-XML DC-RDF RDF

Ostali podatki

Sekundarni jezik: Angleški jezik
Sekundarni naslov: Web scraping techniques
Sekundarni povzetek: In this thesis we tried to analyse different methodologies of access to unstructured data on websites. Our main focus was on different techniques of gathering information from presentation layer (HTML parsing) with the use of specific tools that we can find in the open source community as well as downsides of commercial data scrapers and scraping services. Because of experience in PHP programming language and a plethora of tools, libraries and products implemented in it, we focused on techniques of web scraping with Curl library in combination with Xpath. Other techniques were also the use of ''headless'' browsers for advanced scraping of data on websites where AJAX requests are used extensively and a tool for automatization of website functionality testing Mink. With the rise and demand of webcrawlers many content providers try to disable access for them by tracking access of the bots. There are different uses of anonymization tools and user identification techniques being used on websites that we analyzed, as well as tackled the legislation concerning webscraping and most widely known legal cases in this industry. Lastly, we mentioned positive and negative aspects of the implemented scraper, as well as upgrading and extending the implementation in terms of request parallelization and distributed control on different servers.
Sekundarne ključne besede: PHP;Mink;webscraping;AJAX;anonymization;legality;automatization;open source;computer science;computer and information science;diploma;
Vrsta datoteke: application/pdf
Vrsta dela (COBISS): Diplomsko delo/naloga
Komentar na gradivo: Univ. v Ljubljani, Fak. za računalništvo in informatiko
Strani: 67 str.
ID: 24168159