Abstract
Recent developments in the field of signal processing, sensor technologies, communications as well as High Performance Computing enable the realization of “smart” spaces, in the sense of physical spaces equipped with technology suitable for the collection, transfer and processing of data with the aim of increasing operational efficiency and improving the quality of the processes performed in them. Applying sophisticated Machine Learning (ML) approaches is benefiting a growing number of applications and a part of the Internet of Things (IoT) vision is being realized. Now, systems and states can be monitored and controlled as required. The central idea lies in the recording of data beyond those characterizing the physical conditions of the space. Individuals produce signals, that related to their activities and are indicative of the process being carried out, and at the same time, they are indicators of its orderly or not progress. This Doctoral Thesis aims to study the two preeminent si ...
Recent developments in the field of signal processing, sensor technologies, communications as well as High Performance Computing enable the realization of “smart” spaces, in the sense of physical spaces equipped with technology suitable for the collection, transfer and processing of data with the aim of increasing operational efficiency and improving the quality of the processes performed in them. Applying sophisticated Machine Learning (ML) approaches is benefiting a growing number of applications and a part of the Internet of Things (IoT) vision is being realized. Now, systems and states can be monitored and controlled as required. The central idea lies in the recording of data beyond those characterizing the physical conditions of the space. Individuals produce signals, that related to their activities and are indicative of the process being carried out, and at the same time, they are indicators of its orderly or not progress. This Doctoral Thesis aims to study the two preeminent signals that are indicative of the process and the prevailing conditions: Sound and Image. The study case is the classrooms of theoretical teaching or laboratory experiments. The aim is to clarify the activities and the interest of the parties involved, resulting in the improvement and upgrading of the quality of the work carried out in a smart room. These signals are rich in information, characterized by various limitations, while processing and understanding them is a challenge. At the same time, a solid understanding of the conditions and consequences of human activity can support the evaluation and therefore the strengthening of processes. In the first part of the Thesis, the classification of sound signals that are characteristic in the evolution of the educational process. Initially, an extensive set of sound features (143 in total) is reported, algorithms for extracting these values are implemented and the sounds are classified with ML algorithms. Given the large number of audio features, the corresponding computational burden using all of them, but also the possible degradation of the classification accuracy due to overfitting, a method of prioritizing the features based on their descriptive ability is proposed. The method is based on Principal Component Analysis (PCA) and is compared to the well-known Relief-F method. Experimental tests are performed with five ML algorithms using an increasing number of features. The experimental results demonstrated the utility of the dimensionality reduction method by achieving a classification accuracy of more than 90% using 25 audio features. Then, Deep Learning (DL) mechanisms are employed, specifically Convolutional Neural Networks (CNNs). In order to improve the classification accuracy, well-established, well-known architectures and networks pre-trained on the large ImageNet image dataset are used. The use of these CNNs is done after the audio signal has been converted into a suitable virtual representation through transfer learning. At this point, the set of hyper-parameter values of retraining networks on new datasets is extensively explored. The result of this investigation is the tuning of the hyper-parameter values that lead to the maximization of the classification accuracy while minimizing the corresponding computational time, achieving a classification accuracy exceeding 90% in three different databases. The research field of sound classification has developed rapidly in recent decades resulting in the creation of sound datasets, which include different, arbitrarily selected sound classes for different case studies. In this regard, two types of systematic associations between sound classes were explored: a) semantic and b) comparative based on sound features. In terms of the first association, audio classes are semantically related taking into account the unifying AudioSet ontology, with audio classes being associated based on the origin (source) of the audio. Regarding the second correlation, it is based on the calculation of the distance of the values of the audio characteristics. At the same time, sounds originating from realistic environments include classes which are combined in a sequential and/or overlapping manner. In these cases it is necessary to separate (segment) the audio stream in order to achieve the classification of each audio segment. A set of parameters such as the minimum sound duration and sound intervals were defined to achieve the segmentation process of the audio streams. The second part of the Thesis concerns the study of Image. The development of increasingly advanced algorithms and models has made object detection and recognition trivial. Therefore, this Thesis focused on a more refined analysis of the image with the aim of facial expression recognition (Facial Emotion Recognition). On the occasion of this study, two image feature extraction mechanisms were investigated: manually, with handcrafted methods, and automatically through DL methods, based on CNNs. Both mechanisms were thoroughly studied in terms of their internal parameters and evaluated on FER databases in terms of their classification accuracy performance and the corresponding computational time. Handcrafted methods were examined concerning their internal parameter values, while the study of neural networks was two-fold. First, the features were extracted without retraining the networks on the new data from different depths of their layers, and then the feature extraction was done after retraining the networks on the new databases through transfer learning. The research showed that without retraining the networks, extracting the features from the deepest layer of CNNs leads to inferior classification accuracy results (74% on average) compared to handcrafted methods (86% on average). Extracting image features form 50% or 75% of the depth of the CNNs results in higher classification accuracy (90% on average) for each image quality case. Retraining CNNs and using the transfer learning method improves the classification accuracy when large databases are available. In addition, each method was evaluated for its robustness to two commonly encountered types of noise, Gaussian and Salt & Pepper. CNNs appear to be more robust as the classification accuracy decreases by 10% versus a 60% decrease using handcrafted methods. The result was the creation of a method selection framework according to the quality of available images, application requirements and specifications. By choosing the appropriate method, a classification accuracy exceeding 92% is achieved for each FER database used. The heterogeneity of available algorithms for signal classification is reflected in the computational resources required to train the models and extract the results. From this point of view, the computational resource requirements can be one of the criteria for choosing the classification method. So a set of training time values for different neural network architectures, with different training configurations and for different datasets was generated. Five neural network-based regression models were trained to estimate the training time. Models were evaluated based on correlation coefficient and root mean square error. The result was that the two-layer neural network yielded the highest correlation coefficient with the smallest error, indicating that this model can provide a good approximation of the computational time required for a case of adjacent data. The available algorithms differ from each other in terms of their architecture and in particular, the number of their layers, the complexity of the connection between them, the number and size of filters. The heterogeneity of the available algorithms is reflected in the computational resources required to train the models and derive the inferences. From this point of view, the computational resource requirements can be the one of the criteria for choosing the classification method. For this purpose, a set of training time values for different CNN architectures, with different training configurations, and for different datasets was constructed. Five neural network-based regression models were trained to estimate the training time. Models were evaluated based on correlation coefficient and root mean square error. The value of this PhD Thesis is confirmed by further application of the methods and algorithms in open-space research cases. In particular, sound classification methods with the prioritization of sound features were applied to research focused on environmental noise in an urban landscape, achieving a classification accuracy of eight types of urban noise that reaches 85%. By achieving high noise classification performance it is possible to understand its origin and take appropriate measures to reduce noise pollution. Image classification, combined with semantic segmentation of their content, was applied to research focused on forest fire detection. In this way, a risk analysis framework is created based on the infrastructure that exists in the area, but also the method of intervention depending on the access to it. The detection of objects of interest combined with the classification results that exceed 95% highlight the value of the proposed research in the early response to fires, but also the extensibility of the field of applications of the developed algorithms.
show more