SLIDE: Scalable Information Discovery and Data Ethics

Defended PhDs

Large scale specialized joins, Rafik Saad
Defense:
Funding: AGIR program
Directors: Sihem Amer-Yahia
co-supervisor: Vincent Leroy

abstract

The problem of implementing join operations on distributed systems such as Hadoop MapReduce has been extensively studied. Okcan and Riedewald have shown that it is impossible to find an efficient pruning strategy for the general theta-join problem. Hence, all pairs of tuples must be evaluated. The goal of this thesis is to explore the space between the simple equi-join and the general theta-join, and identify interesting join primitives that can be efficiently implemented. In particular, we focus on hierarchical datasets such as ontologies.

Semantics and Algorithms for Temporal Big Joins, Julien Pilourdault
Defense: September 28, 2017
Funding: French Ministry of Higher Education and Research (MESR)
Directors: Sihem Amer-Yahia
co-supervisor: Vincent Leroy

abstract

Joins are essential operations at the core of data processing. A join is usually used to connect different data items with a predicate. My thesis will investigate different families of temporal join predicates. Indeed, temporal data, and therefore temporal predicates, are ubiquitous in big data applications such as data center monitoring, analysis of user data in the context of the social web or self quantification using sensors. Therefore, we will propose temporal semantics in relation with these applications.
Since the applications we are interested in manipulate very large data volumes, we propose to associate a scoring function to join operations and return only the most relevant results. While join processing has been extensively studied, the ability to execute big data joins that return ranked results has received very little attention as of today. The goal of this PhD thesis is to define a formal semantics for temporal joins with scoring functions, and to design distributed join algorithms on top of large-scale processing frameworks (i.e. MapReduce, Spark …) for a selected set of temporal predicates.

jury :
– Nadia Brauner, Professor at Grenoble Alpes University, Examiner
– Patrick Valduriez, Senior Researcher at INRIA, Reviewer
– Donald Kossmann, Professor at ETH Zurich, Reviewer
– Yanlei Diao, Professor at Ecole Polytechnique, Examiner
– Sihem Amer-Yahia, Senior Researcher at CNRS, Director
– Vincent Leroy, Professor at Grenoble Alpes University, Co-Advisor

Temporal analysis of social content,Sofia Kleisarchaki
Defense: November 2016
Funding: French Ministry of Higher Education and Research (MESR)
Directors: Sihem Amer-Yahia, Vassilis Christophides (Univ. of Crete) and Ahlame Douzal (LIG, AMA team)

abstract

Data from the social Web are characterized by an ephemeral nature, where interactions/associations between people, topics of conversations and real-world venue activities are evolving over time. In the context of this Ph.D. thesis, we focus on the temporal analysis of social content in order to understand how topical and geographical properties affect the evolution of users’ group discussions over time. Temporal analysis requires the design of new data structures and methods for scalable and dynamic time segmentation of input data.

Mining and ranking closed itemsets from large-scale transactional datasets, Martin Kirchgessner
Defense: September 26, 2016
Funding: Investissements d’avenir Datalyse
Directors: Sihem Amer-Yahia
co-supervisor: Vincent Leroy

abstract

The recent increase of data volumes raises new challenges for itemset mining algorithms. In this thesis, we focus on transactional datasets (collections of items sets, for example supermarket tickets) containing at least a million transactions over hundreds of thousands items. These datasets usually follow a “long tail” distribution: a few items are very frequent, and most items appear rarely. Such distributions are often truncated by existing itemset mining algorithms, whose results concern only a very small portion of the available items (the most frequents, usually). Thus, existing methods fail to concisely provide relevant insights on large datasets. We therefore introduce a new semantics which is more intuitive for the analyst: browsing associations per item, for any item, and less than a hundred associations at once.
To address the items’ coverage challenge, our first contribution is the item-centric mining problem. It consists in computing, for each item in the dataset, the k most frequent closed itemsets containing this item. We present an algorithm to solve it, TopPI. We show that TopPI computes efficiently interesting results over our datasets, outperforming simpler solutions or emulations based on existing algorithms, both in terms of run-time and result completeness. We also show and empirically validate how TopPI can be parallelized, on multi-core machines and on Hadoop clusters, in order to speed-up computation on large scale datasets.
Our second contribution is CAPA, a framework allowing us to study which existing measures of association rules’ quality are relevant to rank results. This concerns results obtained from TopPI or from jLCM, our implementation of a state-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantitative study shows that the 39 quality measures we compare can be grouped into 5 families, based on the similarity of the rankings they produce. We also involve marketing experts in a qualitative study, in order to discover which of the 5 families we propose highlights the most interesting associations for their domain.
Our close collaboration with Intermarché, one of our industrial partners in the Datalyse project, allows us to show extensive experiments on real, nation-wide supermarket data. We present a complete analytics workflow addressing this use case. We also experiment on Web data. Our contributions can be relevant in various other fields, thanks to the genericity of transactional datasets.
Altogether our contributions allow analysts to discover associations of interest in modern datasets. We pave the way for a more reactive discovery of items’ associations in large-scale datasets, whether on highly dynamic data or for interactive exploration systems.

jury :
– Laure Berti-Equille, Senior Scientist, Qatar Computing Research Institute, Reviewer
– Florent Masseglia, Research Scientist, INRIA, Reviewer
– Céline Robardet, Professor, INSA Lyon, Examiner
– Noël De Palma, Professor, Grenoble Alpes University, Examiner
– Renaud Lachaize, Associate Professor, Grenoble Alpes University, Examiner
– Alexandre Termier, Professor, rennes 1 University, Examiner
– Sihem Amer-Yahia, Research Director, CNRS, Director
– Vincent Leroy, Associate Professor, Grenoble Alpes University, Co-Advisor

Une Approche de Fouille de Données pour le Débogage Temporel des Applications Embarquées de Streaming, Oleg Iegorov
Defense: April 8, 2016
Funding:
Directors: Jean-François MEHAUT, Miguel SANTANA (STMicroelectronics)
co-supervisor: Alexandre TERMIER (IRISA), Vincent LEROY

abstract

Debugging streaming applications run on multimedia embedded systems found in modern consumer electronics (e.g. in set-top boxes, smartphones, etc) is one of the most challenging areas of embedded software development. With each generation of hardware, more powerful and complex Systems-on-Chip (SoC) are released, and developers constantly strive to adapt their applications to these new platforms. Embedded software must not only return correct results but also deliver these results on time in order to respect the Quality-of-Service (QoS) properties of the entire system. The non-respect of QoS properties lead to the appearance of temporal bugs which manifest themselves in multimedia embedded systems as, for example, glitches in the video or cracks in the sound. Temporal debugging proves to be tricky as temporal bugs are not related to the functional correctness of the code, thus making traditional GDB-like debuggers essentially useless. Violations of QoS properties can stem from complex interactions between a particular application and the system or other applications; the complete execution context must be, therefore, taken into account in order to perform temporal debugging. Recent advances in tracing technology allow software developers to capture a trace of the system’s execution and to analyze it afterwards to understand which particular system activity is responsible for the violations of QoS properties. However, such traces have a large volume, and understanding them requires data analysis skills that are currently out of the scope of the developers’ education.

In this thesis, we propose SATM (Streaming Application Trace Miner) – a novel temporal debugging approach for embedded streaming applications. SATM is based on the premise that such applications are designed under the dataflow model of computation, i.e. as a directed graph where data flows between computational units called actors. In such setting, actors must be scheduled in a periodic way in order to meet QoS properties expressed as real-time constraints, e.g. displaying 30 video frames per second. We show that an actor which does not eventually respect its period at runtime causes the violation of the application’s real-time constraints. In practice, SATM is a data analysis workflow combining statistical measures and data mining algorithms. It provides an automatic solution to the problem of temporal debugging of streaming applications. Given an execution trace of a streaming application exhibiting low QoS as well as a list of its actors, SATM firstly determines exact actors’ invocations found in the trace. It then discovers the actors’ periods, as well as parts of the trace in which the periods are not respected. Those parts are further analyzed to extract patterns of system activity that differentiate them from other parts of the trace. Such patterns can give strong hints on the origin of the problem and are returned to the developer. More specifically, we represent those patterns as minimal contrast sequences and investigate various solutions to mine such sequences from execution trace data.

Finally, we demonstrate SATM’s ability to detect both an artificial perturbation injected in an open source multimedia framework, as well as temporal bugs from two industrial use cases coming from STMicroelectronics. We also provide an extensive analysis of sequential pattern mining algorithms applied on execution trace data and explain why state-of-the-art algorithms fail to efficiently mine sequential patterns from real-world traces.

jury :
– Jean-François MEHAUT, Professor, grenoble Alpes Universty, Director
– Miguel SANTANA, STMicroelectronics, Co-director
– Alexandre TERMIER, Professor, rennes 1 University, Co-Advisor
– Vincent LEROY, Associate Professor, Grenoble Alpes University, Co-Advisor
– Sebastian FISCHMEISTER, Professor, Waterloo University, Reviewer
– Maguelonne TEISSEIRE, Professor, (IRSTEA), Reviewer
– Marc PLANTEVIT, Professor, Lyon 1 University, Examiner
– Raymond NAMYST, Professor, Bordeaux University, Examiner

Optimization-based User Group Management: Discovery, Analysis, Recommendation, Behrooz Omidvar-Tehrani
Defense: November 6, 2015
Funding: French Ministry of Higher Education and Research (MESR)
Directors: Sihem Amer-Yahia
co-supervisor: Alexandre Termier (IRISA, Univ. Rennes 1)

abstract

User data is becoming increasingly available in multiple domains ranging from phone usage traces to data on the social Web. User data is a special type of data that is described by user demographics (e.g., age, gender, occupation, etc.) and user activities (e.g., rating, voting, watching a movie, etc.) The analysis of user data is appealing to scientists who work on population studies, online marketing, recommendations, and large-scale data analytics. However, analysis tools for user data is still lacking.

In this thesis, we believe there exists a unique opportunity to analyze user data in the form of user groups. This is in contrast with individual user analysis and also statistical analysis on the whole population. A group is defined as a set of users whose members have either common demographics or common activities. Group-level analysis reduces the amount of sparsity and noise in data and leads to new insights. In this thesis, we propose a user group management framework consisting of following components: user group discovery, analysis and recommendation.

The very first step in our framework is group discovery, i.e., given raw user data, obtain user groups by optimizing one or more quality dimensions. The second component (i.e., analysis) is necessary to tackle the problem of information overload: the output of a user group discovery step often contains millions of user groups. It is a tedious task for an analyst to skim over all produced groups. Thus we need analysis tools to provide valuable insights in this huge space of user groups. The final question in the framework is how to use the found groups. In this thesis, we investigate one of these applications, i.e., user group recommendation, by considering affinities between group members.

All our contributions of the proposed framework are evaluated using an extensive set of experiments both for quality and performance.

jury :
– M. Divesh Srivastava – Research Director, AT&T Laboratory, Reviewer
– M. Bruno Crémilleux – Professor, Caen Normandie University, Reviewer
– Mme. Ahlame Douzal – Associate Professor, Grenoble Alpes University, Examiner
– M. David Gross Amblard – Professor, Rennes 1 University, Examiner
– Mme. Elisa Fromont – Associate Professor, Jean Monnet University, Examiner
– Mme. Sihem Amer-Yahia, Research Director, CNRS, Director
– M. Alexandre Termier, Professor, Rennes 1 University, Co-advisor

Exploiting context for a structuration of execution traces in abstraction layers
, Léon Fopa
Defense: June 23, 2015
Funding: FUI SoCTrace (Jan 12 – Jan 15)
Directors: Jean-François Mehaut (LIG, Nanosim team)
co-supervisor: Fabrice Jouanot and Alexandre Termier (IRISA, Univ. Rennes 1)

abstract

Applications analysis and debugging techniques are increasingly challenging task in modern systems. Especially in systems based on embedded multiprocessor components (or MPSoC) that make up the majority of our daily devices today. The use of execution traces is unavoidable to apply a detailed analysis of such systems and identify unexpected behaviors. Even if the trace offers a rich corpus of information to the developer for her work, information relevant to the analysis are hidden in the trace and is unusable without a high level of expertise. Tools dedicated to trace analysis become necessary. However existing tools take little or no account of the specific business aspects to an application or the developer’s business knowledge to optimize the analysis task.
In this thesis, we propose an approach that allows the developer to represent, manipulate and query an execution trace based on concepts related to her own business knowledge. Our approach is the use of an ontology to model and query business concepts in a trace, and the use of an inference engine to reason about these business concepts. Specifically, we propose VIDECOM, the domain ontology for the analysis of execution traces of multimedia applications embedded on MPSoC. We then focus on scaling the operation of this ontology for the analysis of huge traces. Thus, we make a comparative study of different ontologies management systems (or triplestores) to determine the most appropriate architecture for very large traces in our VIDECOM ontology. We also propose an inference engine that addresses the challenges of reasoning about business concepts, namely the inference of the temporal order between business concepts in the trace and the termination of the process of generating!
new knowledge from business knowledge. Finally, we illustrate the practical use of VIDECOM in the SoC-Trace project for the analysis of real execution traces on MPSoC.

Jury:
– M. François Goasdoué, Professor, University of Rennes, Reviewer,
– M. Pierre Boulet, Professor, University of Lille, Reviewer,
– Mme. Nadine Cullot, Professor, University of Bourgogne, Examiner,
– Mme. Marie-Odile Cordier, Professor, Université de Rennes, Examiner,
– M. Jean-françois Méhaut, Professor, University of Grenoble, Director,
– M. Maurice Tchuenté, Professor, University of Yaoundé 1, co-Director,
– M. Fabrice Jouanot, Associate Professor, University of Grenoble, advisor,
– M. Alexandre Termier, Professor, University of Rennes, co-advisor.

Uncertainty-sentitive Reasoning over the Web of Data , Mustafa Al-Bakri
Defense: December 15, 2014
Funding: French Ministry of Higher Education and Research (MESR)
Directors: Marie-Christine Rousset
co-supervisor: Manuel Atencia (LIG/Inria, EXMO team)

abstract

In this thesis we investigate several approaches that help users to find useful and trustful informationin the Web of Data using the Semantic Web technologies. In this purpose, we tackle tworesearch issues: Data Linkage in Linked Data and Trust in Semantic P2P Networks.We model the problem of data linkage in Linked Data as a reasoning problem on possibly
decentralized data. We describe a novel Import-by-Query algorithm that alternates steps of subquery rewriting and of tailored querying the Linked Data cloud in order to import data as specific as possible for inferring or contradicting given target same-as facts. Experiments conducted on real-world datasets have demonstrated the feasibility of this approach and its usefulness in practice for data linkage and disambiguation. Furthermore, we propose an adaptation of this approach to take into account possibly uncertain data and knowledge, with a result the inference of same-as and different-from links associated with probabilistic weights. In this adaptation we model uncertainty as probability values. Our experiments have shown that our adapted approach scales to large data sets and produces meaningful probabilistic weights.

Concerning trust, we introduce a trust mechanism for guiding the query-answering process in
Semantic P2P Networks. Peers in Semantic P2P Networks organize their information using separate ontologies and rely on alignments between their ontologies for translating queries. Trust is such a setting is subjective and estimates the probability that a peer will provide satisfactory answers for specific queries in future interactions. In order to compute trust, the mechanism exploits the information provided by alignments, along with the one that comes from peer’s experiences. The calculated trust values are refined over time using Bayesian inference as more queries are sent and answers received. For the evaluation of our mechanism, we have built a semantic P2P bookmarking system (TrustMe) in which we can vary different quantitative and qualitative parameters. The experimental results show the convergence of trust, and highlight the gain in the quality of peers’ answers —measured with precision and recall— when the process of query answering is guided by our trust mechanism.

jury :
– M. Andrea Tettamanzi – Professor, Université Nice Sophia Antipolis, Reviewer
– M. Mohand-Said Hacid – Professor, Université Claude Bernard Lyon 1, Reviewer
– Mme. Marie-Laure Mugnier – Professor, Université de Montpellier 2, Examiner
– M. Jérôme Euzenat – Reseach Director, INRIA Rhône-Alpes, Examiner
– Mme. Marie Christine ROUSSET – Professor, Université Joseph Fourier (Grenoble 1), Director
– M. Manuel ATENCIA – Associate Professor, Université Pierre Mendès-France University (Grenoble 2), Co-advisor

Abstraction and comparison of execution traces for analysis of embedded multimedia applications, Christiane Kamdem
Defense: December 5, 2014
Funding: FUI SoCTrace (Jan 12 – Jan 15)
Directors: Marie-Christine Rousset, Maurice Tchuente (univ. Yaoundé 1)
co-supervisor: Noha Ibrahim

abstract

Nowadays, due to the increasing complexity in both the applications and the underlying hardware, it is difficult to understand what happens during the execution of these applications. Tracing techniques are commonly used to gather and provide information on application execution in the form of execution traces. The execution traces, which are sequences of events, can be very large (easily millions of events), hard to understand and thus require specific analysis tools. One critical case is the analysis of applications on embedded systems such as set-top boxes or smartphones, especially for understanding bugs of multimedia applications. In this thesis we propose two novel analysis techniques adapted to multimedia applications on embedded systems. The first method reduces size of trace given to the analysts. This method needs to group sets of related events together. We propose an approach based on optimization and pattern mining techniques to automatically extract a set of subsequences from an execution trace. Our experiments showed that the method scales on large amounts of data and at the same time, highlighted the practical interest of this approach. Our second contribution consists in proposing a diagnosis method based on the comparison of execution traces with reference traces. This approach is implemented in TED, our TracE Diagnosis tool. Experiments conducted on real-life use cases of multimedia application execution traces have validated that TED is scalable and brings added value to traces analysis. We also show that the tool can be applied on reduced size traces in order to further improve scalability.
Jury:
– M. Laks V.S. LAKSHMANAN, Professor, University of British Columbia, Reviewer,
– M. Pascal PONCELET, Professor, University of Montpellier 2, Reviewer,
– M. Alexandre TERMIER, Professor, University of Rennes 1, Examiner,
– Mme Céline ROBARDED, Professor, INSA Lyon, University of Lyon, Examiner,
– M. Éric GAUSSIER, Professor, University of Grenoble-Alpes , Examiner,
– Mme Marie-Christine ROUSSET, Professor, University of Grenoble-Alpes, Director,
– M. Maurice TCHUENTE, Professor, University of Yaounde I, Co-advisor,
– Mme Noha IBRAHIM, Associate Professor, Université de Grenoble, Co-advisor.

Privacy preserving data publishing on Linked Open Data,Privacy preserving data publishing on Linked Open Data, Rémi Delanaux
Defense: December 13, 2019
Funding: Region Auvergne Rhone-Alpes
Directors: Angela Bonifati (LIRIS), Marie-Christine Rousset

abstract

Privacy is a major concern when publishing new datasets in
the context of Linked Open Data (LOD). A new dataset published in the
LOD is indeed exposed to privacy breaches due to the linkage to objects
already present in the other datasets of the LOD. In this work, we focus
on the problem of building safe anonymizations of an RDF graph to
guarantee that linking the anonymized graph with any external RDF
graph will not cause privacy breaches. Given a set of privacy queries as
input, we study the data-independent safety problem and the sequence of
anonymization operations necessary to enforce it. We provide sucient
conditions under which an anonymization instance is safe given a set of
privacy queries. Additionally, we show that our algorithms for RDF data
anonymization are robust in the presence of sameAs links that can be
explicit or inferred by additional knowledge.

Large-scale ontology-based data analytics: application to the SIDES 3.0 training platform in Medicine, Adam Sanchez
Funding: SIDES 3.0 project
Directors: Marie-Christine Rousset
co-supervisor: Fabrice Jouanot

abstract

The goal of the PhD work is to design and implement an OBDA infrastructure to support the storage, semantic enrichment and the on-demand analysis of large amounts of educational data. Ontology-based Data Access (OBDA) is a novel paradigm at the crossroad of Artificial Intelligence and Databases that has recently received an increasing interest because it enables end-users to ask their own (possibly complex) queries using a domain vocabulary they are familiar with. OBDA systems have a three-level architecture, constituted by an ontology, data sources, and mappings between the two. In the standard OBDA approach (called the virtual approach), the mappings are exploited to reformulate users queries into effective queries evaluated over the data sources. However, the state-of-the art rewriting approaches are unable to deal with complex aggregated and counting queries, which are yet at the core of data analytics.

OntoSAMSEI: Interactive ontology modeling for supporting simulation-based training in Medicine,Shadi Baghernezhad-Tabasi
Funding: University of Lyon 1, Samsei project
Directors: Marie-Christine Rousset
co-supervisor: Fabrice Jouanot, Loic Druette

abstract

Medical simulation is now a central thread in the fabric of medical education and as an integrative strategy to bridge theory to practice has been identified as a need in medical education in the future. Due to the recency of simulation-based training in medicine and the scarcity of available documentation and modelisation, current information retrieval and data mining approaches are not effective in understanding the context of simulation-based training content. The overall objectives of our research are to develop an interactive and incremental ontology modeling approach in order to model ill-defined domains such as some sub-domains related to pedagogy. We propose building an ontology for simulation-based medical education domain, called ONTOSAMSEI. The main contribution includes a new tool to automatically generate pre-filled forms guided by the ontology in order to share the acquired knowledge with the domain experts, and collect new information from them to enrich the ontology.