SLIDE Research Topics
Our research is applied to a variety of datasets ranging from personal data to execution traces and data from the semantic Web. A large number of our applications are evaluated using traditional information retrieval and learning techniques, and crowdsourcing.
Data acquisition and enrichment
The goal of a social applications such as website recommendation on del.icio.us, sentiment extraction from Twitter, movie recommendation on MovieLens and itinerary extraction from Flickr is to analyze Big Social Data (BSD) and process it in order to understand it and transform it into valuable content to users. Building social applications requires an essential data preparation step during which raw BSD is sanitized, normalized, enriched, pruned, and transformed making it readily available for further processing. We argue for the need to formalize data preparation and develop appropriate tools to enable easy prototyping of social applications. We will develop a framework composed of an architecture, algebra and language for data preparation in social applications. We examined a large number of efforts in building two families of social applications, content recommendation and analytics, that manipulate Big Social Data (BSD) and based on these works we proposed SOCLE a framework for data preparation in social application. The development of SOCLE will begin by implementing the necessary modules that are Pruning, Normalization, User similarity Functions, Network Construction Functions and Cluster and Index Generation and by tackling several challenges such as the storage model, the expressivity of the adopted algebra language and the invertibility of the data preparation operators.
Web data linkage
The adoption of the Linked Data principles has led a Web of data of several billions of RDF triples, interlinked by several millions of RDF links and connecting thousands of data sources from diverse domains such as people,companies, books, scientific publications, films, music, genes, proteins, drugs, scientific data, and reviews. RDF links between datasets take the form of RDF triples the subject of which is a URI reference in the namespace of one data set, while the predicate and/or object are URI references pointing into the namespaces of other data sets. A particular case corresponds to same-as facts that express that two URIs refer to the same real-world object. We investigate a logic-based and declarative approach to infer same-as facts from the schema constraints that are known on the domain. Most of the existing approaches follow the main trend for entity resolution and consist in defining similarity metrics to compare entities and applying aggregation functions for combining similarity scores obtained for specific properties. In contrast, our approach is based on reasoning on data using inference rules that capture in a uniform way several schema constraints but also same-as transitivity and possibly useful domain-specific knowledge. In order to handle local data incompleteness, we have designed an import-by-query approach based on rules in order to import from external sources just the sufficient and precise data that is necessary for inferring the target same-as facts. This work is done in the setting of the Qualinca project, and experimented on real data from INA (Institut National de l’Audiovisuel).
Crowd data sourcing
We investigate optimization opportunities in “collaborative crowdsourcing”. In contrast to traditional crowdsourcing, such as image tagging or categorization, where a number of people work on simple micro-tasks without any collaborations, we consider collaborative tasks (example applications include citizen science, MOOCs, participatory sensing), where workers form groups to work on complex, knowledge intensive tasks to synergistically achieve a goal. Our efforts in this space could be broadly classified as follows:
a) how to formalize an application-agnostic optimization model for collaborative crowdsourcing.
b) how to perform effective worker-to-task assignment.
c) how to estimate quality for completed tasks.
d) how to learn human factors.
In particular, both a) and d) attempt to investigate different socio-psychological aspects that motivate individuals to contribute in a group-based effort; such as, group size, group dynamics, worker affinity, etc., and their potential impacts in the individual or the group’s performance.
Large–scale data analytics
Our data analytics axis (D for Discovery in SLIDE) covers advanced pattern mining including generic mining algorithms, mining on parallel infrastructures such as MapReduce and manycore processors and social media analytics. Our models and algorithms combine data mining with multi-dimensional indexing to discover a variety of information of interest from raw data. Our expert-facing applications enable new data exploration approaches such as interactive mining. In this same axis, we are developing a data preparation framework (algebra and algorithms) for sanitizing and transforming large data volumes into ready-to-be exploited data. We are also designing a crowd data sourcing framework that optimizes data acquisition from users online. Finally, we develop extensions to Datalog to express and infer linkage between heterogeneous data sources.
Advanced pattern mining
We focus on efficient generic algorithms. Unlike most research in pattern mining where one algorithm can only extract one given type of pattern, we propose algorithms that can efficiently extract many different types of patterns. Our approach for genericity is grounded on a characterization of the structure of the set of patterns to compute through set theory, and exploit this structure both for pattern enumeration and for efficient data access.
We develop parallel pattern mining algorithms extended to distributed architectures (MapReduce, Twitter Storm) to scale beyond single server performance, and applied to (i) the social web such as collaborative music rating sites in order to identify the items most commonly listened to together and (ii) merchant sites in digital marketing in order to quantify user engagement and conversion rates on those sites.
We also want our algorithms to have a manageable output. Classical pattern mining algorithms often output millions of patterns, overwhelming the analyst. Our research focuses on proposing principled ways to output few meaningful patterns to the analyst, with application to to the analysis of execution traces thanks to a strong partnership with STMicroelectronics. The principles explored are: (i) exploiting scoring functions and combinatorial optimisation techniques, (ii) exploit techniques developed in web search such as top-k, (iii) exploit domain knowledge expressed through ontologies to help guide the extraction of meaningful patterns.
Distributed data processing algorithms
We are developing new data processing algorithms to leverage multi-scale parallel infrastructures, ranging from many-cores to large Hadoop clusters. Processing on such systems involves manipulating large amounts of data over distributed architectures. In this context, remote data accesses constitute a major bottlenecks. Hence, it is important to optimize data placement so that all the pieces of data likely to be required simultaneously by a given process are co-located to enable in-place processing.
Following previous work focusing on social applications, we will develop new data placement strategies adapted to more general data mining problems and extend them to stream processing platforms. Our efforts in this space are focused on: (i) Join operations (ii) Pattern mining.
(i) General join operations are notoriously costly to process efficiently on clusters of machines. Our goal in this domain is to consider particular classes of join operations that encapsulate common useful data mining operations, but are restrictive enough to allow the conception of precise and optimized algorithms. In particular, we consider the case of hierarchical data as well as time intervals. We explore these algorithms in the context of batch processing (Hadoop) as well as stream processing (ex. Twitter Storm).
(ii) We are developing the next generation of pattern mining algorithms that exploit the parallel processing power of multi/many-core processors (ex. Kalray MMPA 256 cores) or GPUs in order to reduce their running time. Our approaches exploit different levels of parallelism that matches the hierarchical nature of these architectures and optimized memory accesses in a NUMA (Non Uniform Memory Access) context.
Social media and health analytics
Recent studies show a great impact of lifestyle on health. Several diseases such as diabetes, those related to hypertension, cardiovascular ones and some cancer types can be prevented with the right nutrition and a healthy exercise habits.
The data we want to gather are on individuals and their demographics (e.g., age, gender, geographic location), their lifestyle (e.g., nutrition and exercise habits) and their health (e.g., minor or major ailments and diseases). Luckily, an increasing number of mobile health applications for tracking nutritional intake or overall physical activity and calorie burn are made available. As a result, valuable data reflecting the impact of our lifestyle on our health is produced. At the same time, people are increasingly active on social networks such as Twitter and Facebook. They express numerous details on their lives such as their health issues or the special lifestyle they are currently following. The collected data will then be fed into a collection of analytics primitives that can be combined to discover and verify correlations. Those primitives can be classified into 3 types: (i) Text processing (ii) Pattern mining and (iii) Hypotheses testing.
(i) Text processing primitives will be build on existing work on statistical topic discovery and will extend it to take into account spatio-temporal factors. We will focus on tweets located in France and Europe. Geo-tags are important as they reflect the environment we live in (e.g. big cities, countryside). We will also use the timestamp associated with each tweet. The primitives should also use external data sources to identify concepts of interest (related to demographics, health and lifestyle) to extract from Twitter.
(ii) Pattern mining primitives will extend existing mining algorithms, in our case LCM and FrameMiner (our generic pattern mining algorithm), to find correlations over time between lifestyle (e.g. fats, carbs, proteins, fruits, vegetables, meat, running, swimming), health (e.g. diabetes, cardiovascular, cancer, headache, stomach pain, stress level) and demographics (age, income, address, etc.). In this case a pattern will be a combination of two or three concepts related to demographics, health and lifestyle.
(iii) Hypotheses testing primitives will validate correlations found in (ii) using crowdsourcing, in our case, the Nutrinet study (a French success in crowdsourcing from UREN (Nutritional Epidemiology Research Unit) including 125,000 volunteers). In the case of Nutrinet, the collected data is structured and controlled (predefined questionnaires sent on predefined days) unlike Twitter where users express spontaneously their daily habits in an unstructured manner.
Our information exploitation axis (E for Exploitation in SLIDE) covers the development of distributed join algorithms. We combine data partitioning and data placement techniques with traditional join algorithms to design faster data processing on distributed parallel infrastructures. We also develop ontology-based data access algorithms that allow analysts to explore large data volumes through high-level concepts. Our user-facing applications rely on novel search and recommendation algorithms ranging from searching for relevant and diverse results, to defining and implementing novel semantics for recommendation that include social networks and different user similarity functions.
Ontology-based data access
Ontology-based data access (OBDA) is a new paradigm in data management that seeks to exploit the semantic knowledge expressed in ontologies when querying data. Ontologies help palliate data incompleteness by allowing inference of new facts from the ontology and the data, which can result in additional answers to user queries. We are investigating a rule-based OBDA approach that consists in enriching RDF triple stores with Datalog-like rules to express ontological statements beyond statements expressible in RDFS. So far, most of this work has been done in the setting of Description Logics in which an ontology is a set of general concept inclusion statements and constraints expressed within a Tbox, while the data is a set of facts composing the Abox. Such an approach does not fit well the Web of data in which the data and the schema statements are mixed within RDF triple stores, allowing high-level queries both over data and the schema using the SPARQL query language. We are developing a generic deductive triplestore framework that we use to compare saturation-based approaches and query-rewriting approaches for OBDA in the setting of real-world applications. This rule-based OBDA approach has been applied to the domain of anatomy and 3D models by transforming My Corporis Fabrica (MyCF) software into an ontology-based tool for reasoning and querying on complex anatomical models. This work is done in the setting of the PAGODA project and the LBA “equipe-action” of the labex PERSYVAL-lab, in collaboration with Olivier Palombi who is a professor of anatomy and a clinical practitioner (neuro-surgeon), and belongs both to the Computer Graphics team of the LJK Laboratory and the Anatomy Laboratory of the University Hospital of Grenoble (CHU).
Interactive information exploration
We are developing a model and algorithms for helping analysts explore a large number of patterns mined from user datasets. The datasets range from collaborative rating sites such as MovieLens to personal user data. We have approximation algorithms that enable analysts to quickly identify groups of users of interest. In this area, we are paying special attention to the question of validation. We are designing a principled approach to validate the interactive mining of large user datasets.