Abstract
Database management systems (DBMS) were developed to collect, store, organize and manage data. Data and information are retrieved from databases through known and clearly formulated questions (queries) and, additionally, through information discovery with the application of data mining techniques. Data mining algorithms operate on data and discover previously unknown information. In this thesis, a meteorological database is first designed and then target data is used in data mining applications and for conducting research work using a modified Knowledge Discovery from Databases (KDD) procedure. Data Mining applications concerning the operational data of the National Hail Suppression Program of the Hellenic Agricultural Insurance Organization are the Hail class estimation, Maximum hail size prediction, Prediction of hail suppression program seeding parameters, and Extraction of the observed convective day category index. The process of Knowledge Discovery from the meteorological databas ...
Database management systems (DBMS) were developed to collect, store, organize and manage data. Data and information are retrieved from databases through known and clearly formulated questions (queries) and, additionally, through information discovery with the application of data mining techniques. Data mining algorithms operate on data and discover previously unknown information. In this thesis, a meteorological database is first designed and then target data is used in data mining applications and for conducting research work using a modified Knowledge Discovery from Databases (KDD) procedure. Data Mining applications concerning the operational data of the National Hail Suppression Program of the Hellenic Agricultural Insurance Organization are the Hail class estimation, Maximum hail size prediction, Prediction of hail suppression program seeding parameters, and Extraction of the observed convective day category index. The process of Knowledge Discovery from the meteorological database is used to conduct research work by appropriately modifying the CRISP-DM model. The goal is to build one or more data mining models in order to identify the occurrence of precipitation at a point on the ground, using data from a meteorological station of the National Meteorological Service and the whole ERA-40 dataset of the European Centre for Medium-range Weather Forecast (ECMWF). Different scenarios and strategies are formulated for the selection or transformation of the input to data mining techniques, which rely mainly on empirical knowledge of the field data and are used to consider issues that may affect the performance of five classification algorithms. More specifically, the effect the training dataset size has on the performance of the algorithms is studied and the optimal size that ensures the best performance of each algorithm is determined. Furthermore, the study of two different approaches for the formation of training datasets demonstrates that the performance of the algorithms is independent of the choice of the instances, i.e., when random instances or all the instances of randomly selected years are used. During the process of weather forecasting in a region, operational meteorologists usually examine the temporal changes of the meteorological parameters. Considering three different scenarios related to the transformation of the independent variables or input characteristics, the performance of the classification algorithms is better when normal parameter values rather than temporal changes are used. Note that these three scenarios are examined both for the natural distribution of data on the dependent variable and the balanced distribution using the random under resampling method. The distribution of the dependent precipitation class variable raises the class imbalance issue, the handling of which is attempted with the implementation of various methods. More specifically, nine techniques of the resampling method beyond the natural distribution are applied. They are drawn from the literature or are newly proposed based on meteorological expertise. Additionally, the boosting method AdaBoost M1 is applied to improve the performance of classification algorithms. The results show that the performance of only one algorithm is not affected by the application of these techniques when compared to the natural distribution. The performance of the remaining four algorithms improves significantly, particularly when the new proposed technique that is based on meteorological expertise is used.
show more