Defense: September 26, 2016
Funding: Investissements d’avenir Datalyse
Directors: Sihem Amer-Yahia
co-supervisor: Vincent Leroy
The recent increase of data volumes raises new challenges for itemset mining algorithms. In this thesis, we focus on transactional datasets (collections of items sets, for example supermarket tickets) containing at least a million transactions over hundreds of thousands items. These datasets usually follow a “long tail” distribution: a few items are very frequent, and most items appear rarely. Such distributions are often truncated by existing itemset mining algorithms, whose results concern only a very small portion of the available items (the most frequents, usually). Thus, existing methods fail to concisely provide relevant insights on large datasets. We therefore introduce a new semantics which is more intuitive for the analyst: browsing associations per item, for any item, and less than a hundred associations at once.
To address the items’ coverage challenge, our first contribution is the item-centric mining problem. It consists in computing, for each item in the dataset, the k most frequent closed itemsets containing this item. We present an algorithm to solve it, TopPI. We show that TopPI computes efficiently interesting results over our datasets, outperforming simpler solutions or emulations based on existing algorithms, both in terms of run-time and result completeness. We also show and empirically validate how TopPI can be parallelized, on multi-core machines and on Hadoop clusters, in order to speed-up computation on large scale datasets.
Our second contribution is CAPA, a framework allowing us to study which existing measures of association rules’ quality are relevant to rank results. This concerns results obtained from TopPI or from jLCM, our implementation of a state-of-the-art frequent closed itemsets mining algorithm (LCM). Our quantitative study shows that the 39 quality measures we compare can be grouped into 5 families, based on the similarity of the rankings they produce. We also involve marketing experts in a qualitative study, in order to discover which of the 5 families we propose highlights the most interesting associations for their domain.
Our close collaboration with Intermarché, one of our industrial partners in the Datalyse project, allows us to show extensive experiments on real, nation-wide supermarket data. We present a complete analytics workflow addressing this use case. We also experiment on Web data. Our contributions can be relevant in various other fields, thanks to the genericity of transactional datasets.
Altogether our contributions allow analysts to discover associations of interest in modern datasets. We pave the way for a more reactive discovery of items’ associations in large-scale datasets, whether on highly dynamic data or for interactive exploration systems.
– Laure Berti-Equille, Senior Scientist, Qatar Computing Research Institute, Reviewer
– Florent Masseglia, Research Scientist, INRIA, Reviewer
– Céline Robardet, Professor, INSA Lyon, Examiner
– Noël De Palma, Professor, Grenoble Alpes University, Examiner
– Renaud Lachaize, Associate Professor, Grenoble Alpes University, Examiner
– Alexandre Termier, Professor, rennes 1 University, Examiner
– Sihem Amer-Yahia, Research Director, CNRS, Director
– Vincent Leroy, Associate Professor, Grenoble Alpes University, Co-Advisor
Defense: April 8, 2016
Directors: Jean-François MEHAUT, Miguel SANTANA (STMicroelectronics)
co-supervisor: Alexandre TERMIER (IRISA), Vincent LEROY
Debugging streaming applications run on multimedia embedded systems found in modern consumer electronics (e.g. in set-top boxes, smartphones, etc) is one of the most challenging areas of embedded software development. With each generation of hardware, more powerful and complex Systems-on-Chip (SoC) are released, and developers constantly strive to adapt their applications to these new platforms. Embedded software must not only return correct results but also deliver these results on time in order to respect the Quality-of-Service (QoS) properties of the entire system. The non-respect of QoS properties lead to the appearance of temporal bugs which manifest themselves in multimedia embedded systems as, for example, glitches in the video or cracks in the sound. Temporal debugging proves to be tricky as temporal bugs are not related to the functional correctness of the code, thus making traditional GDB-like debuggers essentially useless. Violations of QoS properties can stem from complex interactions between a particular application and the system or other applications; the complete execution context must be, therefore, taken into account in order to perform temporal debugging. Recent advances in tracing technology allow software developers to capture a trace of the system’s execution and to analyze it afterwards to understand which particular system activity is responsible for the violations of QoS properties. However, such traces have a large volume, and understanding them requires data analysis skills that are currently out of the scope of the developers’ education.
In this thesis, we propose SATM (Streaming Application Trace Miner) – a novel temporal debugging approach for embedded streaming applications. SATM is based on the premise that such applications are designed under the dataflow model of computation, i.e. as a directed graph where data flows between computational units called actors. In such setting, actors must be scheduled in a periodic way in order to meet QoS properties expressed as real-time constraints, e.g. displaying 30 video frames per second. We show that an actor which does not eventually respect its period at runtime causes the violation of the application’s real-time constraints. In practice, SATM is a data analysis workflow combining statistical measures and data mining algorithms. It provides an automatic solution to the problem of temporal debugging of streaming applications. Given an execution trace of a streaming application exhibiting low QoS as well as a list of its actors, SATM firstly determines exact actors’ invocations found in the trace. It then discovers the actors’ periods, as well as parts of the trace in which the periods are not respected. Those parts are further analyzed to extract patterns of system activity that differentiate them from other parts of the trace. Such patterns can give strong hints on the origin of the problem and are returned to the developer. More specifically, we represent those patterns as minimal contrast sequences and investigate various solutions to mine such sequences from execution trace data.
Finally, we demonstrate SATM’s ability to detect both an artificial perturbation injected in an open source multimedia framework, as well as temporal bugs from two industrial use cases coming from STMicroelectronics. We also provide an extensive analysis of sequential pattern mining algorithms applied on execution trace data and explain why state-of-the-art algorithms fail to efficiently mine sequential patterns from real-world traces.
– Jean-François MEHAUT, Professor, grenoble Alpes Universty, Director
– Miguel SANTANA, STMicroelectronics, Co-director
– Alexandre TERMIER, Professor, rennes 1 University, Co-Advisor
– Vincent LEROY, Associate Professor, Grenoble Alpes University, Co-Advisor
– Sebastian FISCHMEISTER, Professor, Waterloo University, Reviewer
– Maguelonne TEISSEIRE, Professor, (IRSTEA), Reviewer
– Marc PLANTEVIT, Professor, Lyon 1 University, Examiner
– Raymond NAMYST, Professor, Bordeaux University, Examiner
Defense: November 6, 2015
Funding: French Ministry of Higher Education and Research (MESR)
Directors: Sihem Amer-Yahia
co-supervisor: Alexandre Termier (IRISA, Univ. Rennes 1)
User data is becoming increasingly available in multiple domains ranging from phone usage traces to data on the social Web. User data is a special type of data that is described by user demographics (e.g., age, gender, occupation, etc.) and user activities (e.g., rating, voting, watching a movie, etc.) The analysis of user data is appealing to scientists who work on population studies, online marketing, recommendations, and large-scale data analytics. However, analysis tools for user data is still lacking.
In this thesis, we believe there exists a unique opportunity to analyze user data in the form of user groups. This is in contrast with individual user analysis and also statistical analysis on the whole population. A group is defined as a set of users whose members have either common demographics or common activities. Group-level analysis reduces the amount of sparsity and noise in data and leads to new insights. In this thesis, we propose a user group management framework consisting of following components: user group discovery, analysis and recommendation.
The very first step in our framework is group discovery, i.e., given raw user data, obtain user groups by optimizing one or more quality dimensions. The second component (i.e., analysis) is necessary to tackle the problem of information overload: the output of a user group discovery step often contains millions of user groups. It is a tedious task for an analyst to skim over all produced groups. Thus we need analysis tools to provide valuable insights in this huge space of user groups. The final question in the framework is how to use the found groups. In this thesis, we investigate one of these applications, i.e., user group recommendation, by considering affinities between group members.
All our contributions of the proposed framework are evaluated using an extensive set of experiments both for quality and performance.
– M. Divesh Srivastava – Research Director, AT&T Laboratory, Reviewer
– M. Bruno Crémilleux – Professor, Caen Normandie University, Reviewer
– Mme. Ahlame Douzal – Associate Professor, Grenoble Alpes University, Examiner
– M. David Gross Amblard – Professor, Rennes 1 University, Examiner
– Mme. Elisa Fromont – Associate Professor, Jean Monnet University, Examiner
– Mme. Sihem Amer-Yahia, Research Director, CNRS, Director
– M. Alexandre Termier, Professor, Rennes 1 University, Co-advisor
, Léon Fopa
Defense: June 23, 2015
Funding: FUI SoCTrace (Jan 12 – Jan 15)
Directors: Jean-François Mehaut (LIG, Nanosim team)
co-supervisor: Fabrice Jouanot and Alexandre Termier (IRISA, Univ. Rennes 1)
Applications analysis and debugging techniques are increasingly challenging task in modern systems. Especially in systems based on embedded multiprocessor components (or MPSoC) that make up the majority of our daily devices today. The use of execution traces is unavoidable to apply a detailed analysis of such systems and identify unexpected behaviors. Even if the trace offers a rich corpus of information to the developer for her work, information relevant to the analysis are hidden in the trace and is unusable without a high level of expertise. Tools dedicated to trace analysis become necessary. However existing tools take little or no account of the specific business aspects to an application or the developer’s business knowledge to optimize the analysis task.
In this thesis, we propose an approach that allows the developer to represent, manipulate and query an execution trace based on concepts related to her own business knowledge. Our approach is the use of an ontology to model and query business concepts in a trace, and the use of an inference engine to reason about these business concepts. Specifically, we propose VIDECOM, the domain ontology for the analysis of execution traces of multimedia applications embedded on MPSoC. We then focus on scaling the operation of this ontology for the analysis of huge traces. Thus, we make a comparative study of different ontologies management systems (or triplestores) to determine the most appropriate architecture for very large traces in our VIDECOM ontology. We also propose an inference engine that addresses the challenges of reasoning about business concepts, namely the inference of the temporal order between business concepts in the trace and the termination of the process of generating!
new knowledge from business knowledge. Finally, we illustrate the practical use of VIDECOM in the SoC-Trace project for the analysis of real execution traces on MPSoC.
– M. François Goasdoué, Professor, University of Rennes, Reviewer,
– M. Pierre Boulet, Professor, University of Lille, Reviewer,
– Mme. Nadine Cullot, Professor, University of Bourgogne, Examiner,
– Mme. Marie-Odile Cordier, Professor, Université de Rennes, Examiner,
– M. Jean-françois Méhaut, Professor, University of Grenoble, Director,
– M. Maurice Tchuenté, Professor, University of Yaoundé 1, co-Director,
– M. Fabrice Jouanot, Associate Professor, University of Grenoble, advisor,
– M. Alexandre Termier, Professor, University of Rennes, co-advisor.
Defense: December 15, 2014
Funding: French Ministry of Higher Education and Research (MESR)
Directors: Marie-Christine Rousset
co-supervisor: Manuel Atencia (LIG/Inria, EXMO team)
In this thesis we investigate several approaches that help users to find useful and trustful informationin the Web of Data using the Semantic Web technologies. In this purpose, we tackle tworesearch issues: Data Linkage in Linked Data and Trust in Semantic P2P Networks.We model the problem of data linkage in Linked Data as a reasoning problem on possibly
decentralized data. We describe a novel Import-by-Query algorithm that alternates steps of subquery rewriting and of tailored querying the Linked Data cloud in order to import data as specific as possible for inferring or contradicting given target same-as facts. Experiments conducted on real-world datasets have demonstrated the feasibility of this approach and its usefulness in practice for data linkage and disambiguation. Furthermore, we propose an adaptation of this approach to take into account possibly uncertain data and knowledge, with a result the inference of same-as and different-from links associated with probabilistic weights. In this adaptation we model uncertainty as probability values. Our experiments have shown that our adapted approach scales to large data sets and produces meaningful probabilistic weights.
Concerning trust, we introduce a trust mechanism for guiding the query-answering process in
Semantic P2P Networks. Peers in Semantic P2P Networks organize their information using separate ontologies and rely on alignments between their ontologies for translating queries. Trust is such a setting is subjective and estimates the probability that a peer will provide satisfactory answers for specific queries in future interactions. In order to compute trust, the mechanism exploits the information provided by alignments, along with the one that comes from peer’s experiences. The calculated trust values are refined over time using Bayesian inference as more queries are sent and answers received. For the evaluation of our mechanism, we have built a semantic P2P bookmarking system (TrustMe) in which we can vary different quantitative and qualitative parameters. The experimental results show the convergence of trust, and highlight the gain in the quality of peers’ answers —measured with precision and recall— when the process of query answering is guided by our trust mechanism.
– M. Andrea Tettamanzi – Professor, Université Nice Sophia Antipolis, Reviewer
– M. Mohand-Said Hacid – Professor, Université Claude Bernard Lyon 1, Reviewer
– Mme. Marie-Laure Mugnier – Professor, Université de Montpellier 2, Examiner
– M. Jérôme Euzenat – Reseach Director, INRIA Rhône-Alpes, Examiner
– Mme. Marie Christine ROUSSET – Professor, Université Joseph Fourier (Grenoble 1), Director
– M. Manuel ATENCIA – Associate Professor, Université Pierre Mendès-France University (Grenoble 2), Co-advisor
Defense: December 5, 2014
Funding: FUI SoCTrace (Jan 12 – Jan 15)
Directors: Marie-Christine Rousset, Maurice Tchuente (univ. Yaoundé 1)
co-supervisor: Noha Ibrahim
Nowadays, due to the increasing complexity in both the applications and the underlying hardware, it is difficult to understand what happens during the execution of these applications. Tracing techniques are commonly used to gather and provide information on application execution in the form of execution traces. The execution traces, which are sequences of events, can be very large (easily millions of events), hard to understand and thus require specific analysis tools. One critical case is the analysis of applications on embedded systems such as set-top boxes or smartphones, especially for understanding bugs of multimedia applications. In this thesis we propose two novel analysis techniques adapted to multimedia applications on embedded systems. The first method reduces size of trace given to the analysts. This method needs to group sets of related events together. We propose an approach based on optimization and pattern mining techniques to automatically extract a set of subsequences from an execution trace. Our experiments showed that the method scales on large amounts of data and at the same time, highlighted the practical interest of this approach. Our second contribution consists in proposing a diagnosis method based on the comparison of execution traces with reference traces. This approach is implemented in TED, our TracE Diagnosis tool. Experiments conducted on real-life use cases of multimedia application execution traces have validated that TED is scalable and brings added value to traces analysis. We also show that the tool can be applied on reduced size traces in order to further improve scalability.
– M. Laks V.S. LAKSHMANAN, Professor, University of British Columbia, Reviewer,
– M. Pascal PONCELET, Professor, University of Montpellier 2, Reviewer,
– M. Alexandre TERMIER, Professor, University of Rennes 1, Examiner,
– Mme Céline ROBARDED, Professor, INSA Lyon, University of Lyon, Examiner,
– M. Éric GAUSSIER, Professor, University of Grenoble-Alpes , Examiner,
– Mme Marie-Christine ROUSSET, Professor, University of Grenoble-Alpes, Director,
– M. Maurice TCHUENTE, Professor, University of Yaounde I, Co-advisor,
– Mme Noha IBRAHIM, Associate Professor, Université de Grenoble, Co-advisor.