Napovedovanje vrednosti onesnaževal v zraku s tehnikami strojnega učenja

magistrsko delo

Jernej Katanec (Author), Matej Guid (Mentor), Jana Faganeli Pucer (Co-mentor)

Abstract

V današnjem času onesnaženost zraka predstavlja enega izmed največjih problemov na svetu. V Sloveniji v zadnjih letih beležimo povišane vrednosti dveh onesnaževal, delcev $PM_{10}$ in ozona ($O_3$). Daljša izpostavljenost njunim visokim vrednostim lahko negativno vpliva na zdravje ljudi, zato je zelo pomembno, da znamo natančno napovedati vrednosti teh dveh onesnaževal v zraku. Na ta način lahko sprejmemo preventivne ukrepe, kot je omejeno gibanje na prostem v času povišanih vrednosti $O_3$ ali omejitev prometa v mestih, kar zavira nadaljnje onesnaževanje zraka s $PM_{10}$ delci. V tem delu smo izmerjene vrednosti $PM_{10}$ in $O_3$ predstavili kot časovni vrsti. Iskali smo najboljša modela za napovedovanje njunih dnevnih vrednosti en dan vnaprej. Preizkusili smo model z ekstremnim gradientnim spodbujanjem (XGBoost) in nevronsko mrežo z dolgim kratkoročnim spominom (LSTM) ter njuno učinkovitost primerjali z integriranim avtoregresijskim modelom z drsečimi sredinami (ARIMA). XGBoost je ansambel algoritmov za strojno učenje, ki temelji na odločitvenih drevesih, LSTM pa vrsta nevronske mreže, ki ima sposobnost učenja dolgoročnih odvisnosti. Vsakemu modelu smo poiskali optimalne parametre ter tehnike obdelave časovne vrste, pri modelu LSTM pa smo preizkusili tri različne arhitekture. Sprva smo modela zgradili tako, kot da napovedujeta ob koncu dneva, torej ob polnoči. V realnosti se napovedi opravljajo v poznih jutranjih urah, saj so do takrat na voljo meteorološki podatki jutra za tekoči dan, napovedane vrednosti pa so koristne v času, ko se začnejo ljudje v večjem številu gibati na prostem. Ta pristop smo nato ubrali tudi mi in izkazalo se je, da tako dobimo boljše rezultate. Napovedi smo izboljšali z značilkami, izpeljanimi iz meteoroloških napovedi za tekoči dan, iz meteoroloških podatkov za prejšnji dan in podatkov za tekoči dan do časa napovedi. Izkazalo se je, da k boljšim napovedim vrste $O_3$ največ prispevajo meteorološke napovedi, predvsem sončnega sevanja, oblačnosti in padavin, medtem ko so pri napovedovanju vrste $PM_{10}$ najpomembnejše meteorološke meritve za tekoči dan do trenutka napovedi, predvsem podatki o temperaturni inverziji in temperaturi ozračja. Poskusi so pokazali, da se pomembnosti značilk spremenijo, če vrst pred napovedovanjem ne obdelamo, ter da se pomembnosti značilk razlikujejo po posameznih letnih časih. Rezultati poskusov so pokazali, da dobimo najboljše rezultate z uporabo nevronske mreže LSTM, pri čemer uporabimo dvosmerno arhitekturo. Rezultate smo izboljšali, če smo obravnavanima vrstama odstranili letno ter tedensko sezonsko komponento in ju normalizirali. S primerjavo dveh različnih pristopov napovedovanja smo pokazali, da dobimo boljše rezultate, če za napovedi uporabimo daljšo učno množico.

Keywords

napovedovanje onesnaženosti zraka;napovedovanje vrednosti časovnih vrst;nevronske mreže z dolgim kratkoročnim spominom;ekstremno gradientno spodbujanje;ozon;

Data

Language:	Slovenian
Year of publishing:	2021
Typology:	2.09 - Master's Thesis
Organization:	UL FRI - Faculty of Computer and Information Science
Publisher:	[J. Katanec]
UDC:	004.42
COBISS:	76527363
Views:	1024
Downloads:	113
Average score:	0 (0 votes)
Metadata:

Other data

Secondary language:	English
Secondary title:	Forecasting air pollution levels with machine learning techniques
Secondary abstract:	Nowadays, air pollution is one of the major problems in the world. In the recent years, elevated levels of two pollutants, particulate matter $PM_{10} $ and ozone ($O_3$), have been detected in Slovenia. Prolonged exposure to these high levels can have a negative impact on our health, so it is very important that we know how to accurately forecast the levels of these two pollutants in the air. This way, preventive measures can be taken, such as limited outdoor movement at times of elevated $O_3$ levels or restrictions on urban traffic that prevent further air pollution with $PM_{10}$ particles. In this paper, the measured values of $PM_{10}$ and $O_3$ are presented as time series. We searched for the best forecasting model for the daily values of $PM_{10}$ and $O_3$ one day ahead. We tested the extreme gradient boosting model (XGBoost) and the long short-term memory neural network (LSTM) and compared their efficiency with the autoregressive integrated moving average model (ARIMA). XGBoost is an ensemble of machine learning algorithms based on decision trees, and LSTM is a type of neural network that has the ability to learn long-term dependencies. We searched for optimal parameters and time series preprocessing techniques for each model and tested three different architectures for the LSTM model. Firstly, we built the models as if they were forecasting values at the end of the day, at midnight. In reality, the forecasts are made in the late morning hours because by then the meteorological data for the current morning is available and the forecasted values are useful at a time when more people are starting to move outdoors. Afterwards we adopted this approach as well and have been shown to get better results. We improved the forecasts with the features derived from meteorological forecasts for the current day, meteorological data for the previous day, and data for the current day up to the time of the forecast. It was found that meteorological forecasts, especially solar radiation, cloud cover, and precipitation, contributed most to the $O_3$ forecasts, while for the $PM_{10}$ forecasts, meteorological measurements for the current day up to the time of the forecast, especially temperature inversion and atmospheric temperature, were most important. Experiments have shown that the importance of the features changes when the time series are not preprocessed, and that the importance of the features varies from season to season. The results of the experiments showed that we get the best results with the bidirectional LSTM neural network. The results were improved when the annual and weekly seasonal components were removed from the time series and the time series was normalized. By comparing two different approaches, we showed that we get better results when we use a longer training set for the forecasts.
Secondary keywords:	air pollution prediction;time series forecasting;long short-term memory neural networks;extreme gradient boosting;ozone;
Type (COBISS):	Master's thesis/paper
Study programme:	0
Embargo end date (OpenAIRE):	1970-01-01
Thesis comment:	Univ. v Ljubljani, Fak. za matematiko in fiziko, Oddelek za matematiko, Računalništvo in matematika - 2. stopnja
Pages:	XI, 81 str.
ID:	13484662