magistrsko delo
Martin Pezdir (Author), Ljupčo Todorovski (Mentor)

Abstract

Magistrsko delo predstavlja ogrodje in modele za spletno strganje podatkov o izdelkih iz spletnih trgovin, avtomatično razvrščanje teh izdelkov v kategorije ECOICOP (ang. European Classification of Individual Consumption according to Purpose ali evropska klasifikacija individualne potrošnje po namenu) s pomočjo strojnega učenja in računanje cenovnih indeksov HICŽP (harmonizirani indeks cen življenjskih potrebščin). V delu spletnega strganja opišemo probleme in izzive, s katerimi se soočamo pri avtomatiziranem prenosu podatkov iz spleta. Dotaknemo se tudi zakonodaje na področju spletnega strganja. Implementiramo spletni strgalnik v programskem jeziku Python, ki dnevno prenaša podatke o približno 30.000 izdelkih, naprodaj v spletnih trgovinah dveh največjih slovenskih trgovcih. V drugem delu naredimo uvod v področje strojnega učenja, s poudarkom na pretvorbi tekstovnih in kategoričnih spremenljivk v numerične. Predstavimo in implementiramo dve metodi za obdelavo tekstovnih podatkov - model vreče besed in algoritem word2vec. Opišemo probleme, ki se pojavljajo zaradi specifičnosti naše podatkovne množice in predstavimo rešitve za soočanje z njimi. S strojnim učenjem zgradimo hierarhični model, ki napoveduje v kateri oddelek, skupino, razred ali podrazred spada posamezen izdelek. V zadnjem delu s pomočjo uradne metodologije izračunamo cenovne indekse na posameznih nivojih. Zaradi razpoložljivosti podatkov se osredotočimo samo na oddelek 01 - Hrana in brezalkoholne pijače. Dobimo primerljive cenovne indekse, ki pa zaradi nepoznanega uradnega vzorca podatkov v posameznem agregatu včasih odstopajo od uradnega indeksa.

Keywords

spletno strganje;obdelava naravnega jezika;strojno učenje;klasifikacija;inflacija;

Data

Language: Slovenian
Year of publishing:
Typology: 2.09 - Master's Thesis
Organization: UL FU - Faculty of Administration
Publisher: [M. Pezdir]
UDC: 519.8
COBISS: 32570115 Link will open in a new window
Views: 1200
Downloads: 242
Average score: 0 (0 votes)
Metadata: JSON JSON-RDF JSON-LD TURTLE N-TRIPLES XML RDFA MICRODATA DC-XML DC-RDF RDF

Other data

Secondary language: English
Secondary title: Calculation of price indices with machine learning for automatic product classification
Secondary abstract: The thesis presents a framework and models for Web scraping of data on products from online stores and automatic classification of these produtcs into ECOICOP (European Classification of Individual Consumption according to Purpose) categories using machine learning. From classified products we are able to calculate an estimate of official HICP (Harmonized Index of Consumer Prices). In the part of web scraping, we describe the problems and challenges we face when using web crawlers for automated transfer of data from the web. We touch upon the legislation in the field of Web scraping. We also implement a Web scraper in Python, which daily transfers data on approximately 30.000 products sold by the two largest Slovenian retailers. In the second part, we make basic introduction to the field of machine learning, with an emphasis on the conversion of text and categorical variables into numerical ones. We present and implement two methods for processing text data - bag of words model and the word2vec algorithm. We describe the problems that arise due to the specifics of our dataset and present solutions to deal with them. We use machine learning to build a hierarhical model that predicts categories of ECOICOP an individual product belongs to. In the last part, we use official methodology to calculate an estimate of price indices on different levels. Due to the avaliability of data, we focus only on section 01 - Food and non-alcoholic beverages. We obtain price indices comparable to the official ones, with deviations due to unknown official data sample in each group of products.
Secondary keywords: Web scraping;natural language processing;machine learning;classification;inflation;
Type (COBISS): Master's thesis/paper
Study programme: 0
Embargo end date (OpenAIRE): 1970-01-01
Thesis comment: Univ. v Ljubljani, Fak. za matematiko in fiziko, Oddelek za matematiko, Finančna matematika - 2. stopnja
Pages: XVII, 90 str.
ID: 12074668
Recommended works: