Abstract
Rapidly evolving technologies are constantly expanding the need for analysis and utilization of existing data. Many organizations base their business viability on the analysis of market data as well as the data they produce either by exporting inherent useful statistics and performance indicators or by using them in the decision-making processes, where one of the most important parameters in their analysis is the parameter of time. To store and analyze the huge volume of data, new methods of data management and analysis are created. This was especially noticeable with the advent of Big Data. The technologies that were developed gave the opportunity to expand the methods that existed for conventional data but also to create new methods, techniques and systems so that they can provide the same or even better analytics. However, as technology advances with the advent of IoT, the volume of data and the number of data flows are increasing rapidly. These flows should be stored, analyzed and ...
Rapidly evolving technologies are constantly expanding the need for analysis and utilization of existing data. Many organizations base their business viability on the analysis of market data as well as the data they produce either by exporting inherent useful statistics and performance indicators or by using them in the decision-making processes, where one of the most important parameters in their analysis is the parameter of time. To store and analyze the huge volume of data, new methods of data management and analysis are created. This was especially noticeable with the advent of Big Data. The technologies that were developed gave the opportunity to expand the methods that existed for conventional data but also to create new methods, techniques and systems so that they can provide the same or even better analytics. However, as technology advances with the advent of IoT, the volume of data and the number of data flows are increasing rapidly. These flows should be stored, analyzed and combined with other data to extract useful information. With the advent of ML / AI, more and more processes can be automated to generate new knowledge. One of the main problems, however, is the lack of marked data. One of the most common queries performed to retrieve information from data are the skyline queries. The skyline queries belong to the category of multi-objective optimization problems and aim to retrieve a set of answers that meets some usually conflicting criteria. Using such queries is always helpful as it has many areas of application and can be very helpful in the decision-making process, where there are multiple criteria for achieving a goal and an optimal solution may not be unique. So far, the literature in this field of research shows a significant number of works is mainly concerned with conventional data and there is room for research in the field of Big Data. Taking into account all the above, this Thesis aims to carry out an extensive review in the field of skyline queries, the detection of specifications and needs in data of an information system for maritime environments, the analysis of the time parameter in skyline queries, the development of skyline queries on tree structures specifically designed for Big Data and the implementation of a classifier specifically designed for Big Data environments. More specifically, the first contribution is an extensive review of the existing work on skyline queries in which the skyline family is presented with a wide number of variations over the initial skyline query algorithm, the difference between index based and non-index-based methods and the applications that skyline queries have for problem solving. This review shows that skyline queries have evolved and allows readers to find areas that can be further explored. The second contribution explores the various aspects of data in the context of a maritime information system. This analysis reviews the existing research area and the data needed to implement a maritime information system as well as the limitations that exist in processing and distributing the data. Through this research, the concept of Big Data became apparent, large data sets that are available for analysis were detected and was made clear that time parameterization is very important for performing data analytics. The third contribution studied how can the dimension of time be integrated in skyline queries. The time dimension is an important parameter in data analysis and queries processing that is in many cases is overlooked. This research reveals that the time parameter can affect the skyline, which shows that a special analysis needs to be made regarding the time dimension and to properly modification of the skyline queries in order to integrate the time dimension in them. The fourth contribution examines the application of skyline queries in the field of Big Data and specifically SpatialHadoop. SpatialHadoop is an extension of the conventional Hadoop, which tries to integrate known tree structures that exist for conventional data in Hadoop. Through this analysis we can see the behavior of both types of skyline algorithms, that are indexed-based (or not) in Big Data environments and how the hybrid combinations work using skyline algorithms that are not based on an index over the indexed dataset created by the SpatialHadoop. Finally, one of the biggest problems in deploying a machine learning model is the lack of labeled data. This lack is even more noticeable in Big Data environments as it is more difficult to point them out due their large volume. In the literature there are many mechanisms for labeling data depending on their application but there are no mechanisms for the efficient labeling of large volumes of data. Thus, in the fifth contribution, a classifier was created based on skyline questions. The use of skyline allows the creation of decision boundaries consisting of a small number of points.
show more