diplomsko delo
Peter Grlica (Author), Dejan Lavbič (Mentor)

Abstract

Tehnike spletnega luščenja podatkov

Keywords

PHP;Mink;luščenje podatkov;AJAX;anonimizacija;zakonodaja;avtomatizacija;odprta koda;računalništvo;visokošolski strokovni študij;računalništvo in informatika;diplomske naloge;

Data

Language: Slovenian
Year of publishing:
Typology: 2.11 - Undergraduate Thesis
Organization: UL FRI - Faculty of Computer and Information Science
Publisher: [P. Grlica]
UDC: 004.774.2(043.2)
COBISS: 9990484 Link will open in a new window
Views: 68
Downloads: 3
Average score: 0 (0 votes)
Metadata: JSON JSON-RDF JSON-LD TURTLE N-TRIPLES XML RDFA MICRODATA DC-XML DC-RDF RDF

Other data

Secondary language: English
Secondary title: Web scraping techniques
Secondary abstract: In this thesis we tried to analyse different methodologies of access to unstructured data on websites. Our main focus was on different techniques of gathering information from presentation layer (HTML parsing) with the use of specific tools that we can find in the open source community as well as downsides of commercial data scrapers and scraping services. Because of experience in PHP programming language and a plethora of tools, libraries and products implemented in it, we focused on techniques of web scraping with Curl library in combination with Xpath. Other techniques were also the use of ''headless'' browsers for advanced scraping of data on websites where AJAX requests are used extensively and a tool for automatization of website functionality testing Mink. With the rise and demand of webcrawlers many content providers try to disable access for them by tracking access of the bots. There are different uses of anonymization tools and user identification techniques being used on websites that we analyzed, as well as tackled the legislation concerning webscraping and most widely known legal cases in this industry. Lastly, we mentioned positive and negative aspects of the implemented scraper, as well as upgrading and extending the implementation in terms of request parallelization and distributed control on different servers.
Secondary keywords: PHP;Mink;webscraping;AJAX;anonymization;legality;automatization;open source;computer science;computer and information science;diploma;
File type: application/pdf
Type (COBISS): Bachelor thesis/paper
Thesis comment: Univ. v Ljubljani, Fak. za računalništvo in informatiko
Pages: 67 str.
ID: 24168159