Abstract
The digital data has become a key resource for solving scienƟfic and business problems and achieving compeƟƟve advantage. With this purpose the scienƟfic and business communiƟes worldwide are trying to extract knowledge from data available to them. The Ɵmely use of data significantly affects scienƟfic progress, quality of life, and economic acƟvity.In the digital age the efficient processing and effecƟve data analysis are important challenges.The processing of data in main memory can boost processing efficiency especially if it is combinedwith new soŌware system architectures. At the same Ɵme useful and usable tools are required foranalysing main memory data to saƟsfy important use cases not met by database and programminglanguage technologies.The unified management of the memory hierarchy can improve the processing of data in mainmemory. In this architecture the communicaƟon between the different parts of the memory hierarchy is transparent to the applicaƟons and opƟmizaƟon techniques ...
The digital data has become a key resource for solving scienƟfic and business problems and achieving compeƟƟve advantage. With this purpose the scienƟfic and business communiƟes worldwide are trying to extract knowledge from data available to them. The Ɵmely use of data significantly affects scienƟfic progress, quality of life, and economic acƟvity.In the digital age the efficient processing and effecƟve data analysis are important challenges.The processing of data in main memory can boost processing efficiency especially if it is combinedwith new soŌware system architectures. At the same Ɵme useful and usable tools are required foranalysing main memory data to saƟsfy important use cases not met by database and programminglanguage technologies.The unified management of the memory hierarchy can improve the processing of data in mainmemory. In this architecture the communicaƟon between the different parts of the memory hierarchy is transparent to the applicaƟons and opƟmizaƟon techniques are applied holisƟcally. The data flow in the memory hierarchy so that the ones that will be processed shortly are closest to the Öçs and programming languages treat temporary and permanent data of any type uniformly. As a result, new data analysis systems can be developed that take advantage of faster main memory data structures over disk-based ones for processing the data leaving the memory hierarchy to care for the availability of data.The absence of suitable analyƟcal tools hinders knowledge extracƟon in cases of soŌware applicaƟons that do not need the support of a database system. Some examples are applicaƟons whose data have a complex structure and are oŌen stored in files, eg scienƟfic applicaƟons in areas such as biology, and applicaƟons that do not maintain permanent data, such as data visualizaƟon applicaƟons and diagnosƟc tools. Databases offer widely used and recognized query interfaces, but applicaƟons that do not need the services of a database should not resort to this soluƟon only to saƟsfy the need to analyze their data.Programming languages on the other hand rarely provide expressive and usable query interfaces.These can be used internally in an applicaƟon, but usually they do not offer interacƟve ad-hoc queries at runƟme. Therefore the data analysis scenarios they can support are standard and any addiƟons or modificaƟons to the queries entail recompiling and rerunning the applicaƟon.In addiƟon to solving problems modeled by soŌware applicaƟons, data analysis techniques areuseful for solving problems that occur in the applicaƟons themselves. This is possible by analyzing the metadata that applicaƟons keep in main memory during their operaƟon. This pracƟce can be applied to any kind of system soŌware, such as an operaƟng system.This thesis studies the methods and technologies for supporƟng queries on main memory dataand how the widespread architecture of soŌware systems currently affects technologies. Based onthe findings from the literature we develop a method and a technology to perform interacƟve queries on data that reside in main memory. Our approach is based on the criteria of usefulness and usability.AŌer an overview of the programming languages that fit the data analysis we choose ÝØ½, the standard data manipulaƟon language for decades.The method we develop represents programming data structures in relaƟonal terms as requires ÝØ½. Our method replaces the associaƟons between structures with relaƟonships between relaƟonal representaƟons. The result is a virtual relaƟonal schema of the programming data model, which we call relaƟonal representaƟon.The method’s implementaƟon took place on the and ++ programming languages because oftheir wide use for the development of systems and applicaƟons. An addiƟonal reason why ++ waschosen is the availability of a large number of algorithms and data structures that it offers. The implementaƟon includes a domain specific language for describing relaƟonal representaƟons, a compiler that generates the source code of the relaƟonal interface to the programming data structures given a relaƟonal specificaƟon, and the implementaƟon of ÝØ½ite’s virtual table Ö®. ÝØ½ite is a relaƟonal database system that offers the query engine and the ability to run queries to non-relaƟonal data through its virtual table Ö®.The implementaƟon expands to the development of two diagnosƟc tools for idenƟfying problemsin soŌware systems through queries to main memory metadata related to their state. as theimplementaƟon language of many soŌware systems is ideal for the applicaƟon of this idea. For thispurpose we incorporate our implementaƟon in the Linux kernel. Important implementaƟon aspectsthat we address is synchronized access to data and the integrity of query results. We also apply ourapproach to expand the diagnosƟc capabiliƟes of Valgrind, a system that controls the way that soŌware applicaƟons use memory.The overall evaluaƟon of our approach involves its integraƟon in three ++ soŌware applicaƟons,in the Linux kernel, and in Valgrind, where we also perform a user study with students. For the study we combine qualitaƟve analysis through quesƟonnaire and quanƟtaƟve analysis using code measurements.In the context of the ++ applicaƟons the performance measurements between Öi Ê Ø½queries and the corresponding queries expressed in ++ show that ÝØ½ combined with our relaƟonal representaƟon provides greater expressiveness. The same happens when we compare our approach with ÝØ½ aŌer imporƟng the data into a MyÝØ½ relaƟonal database system. The efficiency of our approach is worse than ++ and beƩer than MyÝØ½. The queries with our approaches need twice as long Ɵme to run compared with ++ regardless of the problem’s size. The ÝØ½ queries in MyÝØ½ require double, triple, or more Ɵme to execute compared to our approach.In the context of the Linux kernel where our relaƟonal interface funcƟons as a diagnosƟc tool wefind real problems by execuƟng queries against the kernel’s data structures. Access to files without the required privileges, unauthorized execuƟon of processes, the idenƟficaƟon of binaries that are used in loading processes but are not used by any, and the direct execuƟon of system calls by processors belonging to a virtual machine are the security problems we idenƟfy. In addiƟon we show queries that combine metrics from different subsystems, such as pages in memory, disk files, processor acƟvity, and network data transfers, which can help idenƟfy performance problems. The measurement of query processing Ɵme and the added overhead to the system encourage the use of our tool.The diagnosƟc tool we developed for Valgrind detects problems, addiƟonal to those found byValgrind, through the use of quesƟons in the collected metadata of the applicaƟon being tested. The bzip2 tool for instance wastes nine hundred » where all the memory cells are consecuƟve in a single pool. This size is equivalent to twelve percent of the total memory that the applicaƟon needs to operate. Through queries on the dynamic funcƟon call graph formed during an applicaƟon’s execuƟonwe find a code path that is performance criƟcal. It is located in the glibc library and is widely used by the sort and uniq Unix tools. This opƟmizaƟon was implemented by glibc’s development team and was included in the next version without our contribuƟon.Finally, in the user study the one group expresses analysis tasks with ÝØ½ queries and the otherwith Python code. The results show that the Ɵme required for the expression of an analysis job issmaller when ÝØ½is used. On the contrary no staƟsƟcally significant differences are observed between the two approaches in terms of usefulness, efficiency, and expressiveness, although our approach has a higher raƟng. For the dimension of usability the evaluaƟons demonstrated no clear winner, but both approaches achieved very good evaluaƟon. The evaluaƟon of the ÝØ½ group code’s performance shows that the ÝØ½ group had more correct replies achieved with less Ɵme of programming. We consider this metric indicaƟve of our approach’s usefulness vis a vis Python, which is also widely used for data analysis. We also consider the Ɵme required for the expression of an analysis task as a usability factor.The challenges to the processing of data conƟnue to emerge at an unabated pace. In this environment soŌware applicaƟons require soluƟons for the analysis of user data, but also to solve problems relaƟng to their operaƟon. The processing of data in main memory can bring important benefits in combinaƟon with other innovaƟons. In this direcƟon new architectures that benefit the efficient processing of data can play an important role. We hope that this thesis will aid the efficient processing and effecƟve data analysis expected by users.
show more