Accessing data in large-scale systems

List of participants: Ch. Collet (Professor), M.-Ch. Rousset (Professor), Ch. Bobineau (Associate Professor), F. Jouanot (Associate Professor), A. Termier (Associate Professor), Benjamen Negrevergne (PhD 2008- )

Query optimization in distributed and dynamic systems

Accessing data concerns several dimensions of large scale systems: number of resources, data volume and data complexity.Current large scale systems in number of resources include grids, peer-to-peer networks, sensor networks, ambient and ubiquitous environments. The most popular method to access data within these systems in a convenient and efficient way is still to consider declarative queries that are optimized based on system characteristics. Due to the strong dynamicity of these systems, classical distributed query evaluation techniques are not applicable.Having a global view of the system is not possible: pertinent data sources cannot be a priori known and useful metadata for query evaluation are not always available. In addition, the evaluation strategy for a query has to dynamically adapt to fluctuating conditions and to users with different needs. For example, some may want to maximize performance while others may need to minimize energy consumption. The HADAS group focuses on new approaches for query evaluation efficiency w.r.t. application needs running on large-scale systems following our precedent works on adaptive query processing. We plan to extend our works on efficient query evaluation in two main directions:
Machine-learning-based adaptive query evaluation. In distributed environments where metadata are lacking, classical query evaluation techniques cannot be applied. We propose machine learning techniques exploiting easures taken during previous query executions to improve performance of future query evaluations (case-based easoning).
Data and network management in dynamic ad-hoc networks. In distributed environments, queries have to be decomposed into subqueries that have to be evaluated on different nodes of the network. In dynamic environments, here is no knowledge about data distribution (localization and volumes). We propose to combine network and ata management by viewing the whole network as a dynamic distributed database system. This work has been tarted in collaboration with the LIAMA in China, and promising results have been already obtained.
These two directions will be explored in the setting of the ANR Blanc 2009 project UBIQUEST and in a collaboration with CEA-LETI.

Mining large amounts of data to extract patterns of interest

Data mining is another way to access large quantities of data, by extracting interesting patterns from them. Such patternsprovide meaningful abstractions of raw data, which are thus less numerous and more appropriate for data analysis. Thegroup works on pattern mining in complex data such as sequences, trees or graphs, which are found in many applicationsin chemistry (e.g. graphs representing molecules) or in bioinformatics (e.g. gene regulation networks).
The focus will be on designing and deploying parallel pattern mining algorithms on multicore processors. The starting DAMOCLES project (supported by the MSTIC pole of UJF) will investigate “DAta Mining for On Chip Low Energy systems”. This project involves HADAS and MESCAL teams of LIG and the machine architecture team of the TIMA laboratory in Grenoble (F. Petrot, SLS team).