Abstract
The increasing number of technological products that interact with various views of a human’s life are currently characterized by a common factor: the production of vast amounts of data, compared especially to the corresponding quantities of the last decades. This fact, which has been ideally harmonized with the parallel evolvement of the storing systems and the data retrieval mechanisms, has inferred great obstacles during their exploitation as informative sources for training the corresponding learning models, according to the principles that are defined by the fields of Machine Learning and Data Mining. The main reason is the inability to obtain the ground truth of their target value, either in the case of Classification or Regression tasks, without procedures that demand a lot of time or monetization costs, regarding the large volumes of data that are usually available. As a solution to this reality, which is characterized by the existence of numerous unlabeled data along with only ...
The increasing number of technological products that interact with various views of a human’s life are currently characterized by a common factor: the production of vast amounts of data, compared especially to the corresponding quantities of the last decades. This fact, which has been ideally harmonized with the parallel evolvement of the storing systems and the data retrieval mechanisms, has inferred great obstacles during their exploitation as informative sources for training the corresponding learning models, according to the principles that are defined by the fields of Machine Learning and Data Mining. The main reason is the inability to obtain the ground truth of their target value, either in the case of Classification or Regression tasks, without procedures that demand a lot of time or monetization costs, regarding the large volumes of data that are usually available. As a solution to this reality, which is characterized by the existence of numerous unlabeled data along with only a few labeled data, the Partially Supervised Learning algorithms has aroused the last years. According to their generic operation, each learning model is refined using only a small portion of the total amount of collected data, which have to get labeled through an accurate mechanism, and then the pool of unlabeled data is mined for detecting the most informative instances, aiming at improving both the predictive behavior and the robustness of its decisions, without depending much on the human factor, whose contribution is usually time-expensive and costly. Indeed, this kind of algorithms try to uncover the underlying distribution of generating data, so as to obtain better generalization performance over unknown data. Two of the most important categories of Partially Supervised Learning algorithms are these of Semi-supervised and Active Learning. Besides the fact that both of them share some common properties, adopting iterative learning schemes and letting probabilistic learners to be combined appropriately under their operating framework, they substantially differ over one point. Although the former category produces fully automated learning tools, the latter employs human oracles into its learning kernel, so as to make queries over the target value of the most ambiguous unlabeled instances. In practice, the main contribution of the current thesis is to highlight the use of ensemble learners into the Partially Supervised Learning algorithms, the adoption of mechanisms that enable the additional reduction of time and cost expenses, observing their behavior into this kind of algorithms, as well as studying the usefulness of similar approaches into more specific scientific fields that have slightly been examined by other demonstrations in the literature. Moreover, in order to favor the wide applicability of the proposed algorithms, loose assumptions were posed, operating under efficient single-view learning schemes, trying to obtain accurate enough predictions through several approaches of formatting base learners. The suggestion also about combining these two distinct approaches under a common framework of exploiting the available data and computational resources, led to a learning strategy that is proved really beneficial on real-life scenarios, which stem from a harmonic combination of separate learning strategies that highly reduce the version space of candidate hypotheses. To sum up, we consider that the current thesis constitutes one comprehensive work that examines the operation of several Partially Supervised Learning algorithms regarding their application on both generic and more specific problems. The ulterior ambition, besides highlighting the applicability of the proposed works, is to provide thorough comments and to extract useful directions and conclusions that could facilitate the further study of this interesting field. Link of Publications: https://dblp.org/pers/hd/k/Karlos:Stamatis
show more