Master's Internship: Debugging Concurrent Programs using Model Checking and Mining Techniques
Master's Internship: Data-driven building of learning models based on a declarative query-answering approach: application to yield prediction in industry.
Master's Internship: Reconciling data privacy and data sharing in digital health records.
Master's Internship: Improving filtering and matching between profiles and offers by combining RDF data, ontologies and textual content.
Master's Internship: Combining RDF data, ontologies and textual content for improving information retrieval in a medical training semantic wiki.
Master's Internship: Croisement de données exogènes et endogènes pour mieux anticiper l’activité de traumatologie au CHU de Grenoble
Master's Internship: Adaptation contextuelle de formulaires médicaux aux pathologies du patient en neurochirurgie
Research engineer positions: Nano 2017 ESPRIT
Stream processing & pattern mining with hardware support
To apply, contact Vincent Leroy
Data production grows continuously. The development of the Internet of things and sensors produce terabits of activity traces. Pattern mining algorithms are a cornerstone of data mining. They consist in detecting frequently occurring patterns (sets, sequences, sub-graphs) In data. These patterns can later be used in many applications, including classification, recommendation and prediction. Existing approaches to pattern mining focus on batch processing, that is offline computation. However, more recent work considers stream (online) processing. Stream processing has the advantage of reading data only once, which limits the complexity at the cost of approximate results. They also allow continuous analysis, hence results can be obtained with low latency to detect anomalies in real time.
The goal of the Nano2017 ESPRIT project is to propose a hardware solution for pattern mining in high throughput data streams. This solution, which could be proposed as a support hardware card, will be able to test simultaneously the presence of a large number of patterns in the data. The benefits of such a solution are (i) to process faster streams than purely software approaches, and (ii) to use less servers to process data streams, thus reducing energy consumption.
DESIRED SKILLS AND EXPERIENCE
- A strong desire to implement systems that use the latest scientific results
- A good command of English
- Ability to work as part of a team
- Sufficient educational background to understand the science and mathematics involved in machine learning/ data mining algorithms
- Experience working in Linux/Unix environment
- Experience in C/C++ or Java
- Experience with at least one of the following: Python, Torch/Lua, Matlab
- Practical experience with machine learning, deep learning is a plus
Internship: Ontology-based Query Answering: comparison between a Query Rewriting approach and a Materialization approach in SIDES 3.0 setting.
SIDES 3.0 is an ontology-based e-learning platform for training and monitoring students in Medicine. It is under construction as a national project funded by ANR for 3 years.
SIDES 3.0 knowledge base is composed by a lightweight domain ontology made of RDFS statements as well as some domain-specific rules, and a huge amount of RDF facts on instances of the classes and properties in the ontology.
Query answering through ontologies is more complex than for classical databases because answers to queries may have to be inferred from the data and the ontological statements, thus requiring reasoning techniques to be employed.
Two approaches have been designed for ontology-based query answering:
Query rewriting consists in exploiting the ontological statements to reformulate the input query into a union of queries the evaluation of which can then be done directly against the data to obtain the set of answers to the input query.
Materialization consists in exploiting the ontological statements as rules in order to saturate the data by inferring all the possible facts that can be derived from the input data set and the ontological statements, and then to evaluate the input query against the saturated dataset.
Most of existing works on query rewriting have considered simple conjunctive queries, which correspond to so-called graph pattern queries in SPARQL. However, SPARQL is a powerful formal query language with many features and constructs that go beyond the expressivity of graph pattern queries.
The objective of this research project is to compare the query rewriting approach and the materialization approach in the setting of SIDES 3.0 in which some features of SPARQL 1.1 (in particular GROUP-BY and FILTER constructs) are extensively used to express queries.
For making the comparison possible, it will be necessary first to study the problem of query rewriting for aggregation queries, and to design query rewriting techniques for these queries.
Co-supervisors: Fabrice Jouanot (LIG, SLIDE), Olivier Palombi (LJK, Inria, Imagine) and Marie-Christine Rousset (LIG, SLIDE)
Internship: Assessing the trustworthiness of online identities
Project Description: The lack of strong identities, i.e., secure and verifiable identities that are backed by a certificate from a trusted authority (e.g., passport or social security number) has been a long-standing problem on the Web. While strong identities could provide better security for Web services they failed to achieve mass adoption because they significantly raise the sign-on barrier for new users and raise privacy concerns – users cannot be anonymous. Consequently, most Web services today provide weak identities that are not backed by any certificate. Unfortunately, weak identities are heavily exploited by malefactors to create multiple fake identities with ease. Such fake identities are traditionally known as Sybils  and are typically used to inundate services with spam, spread fake news, or post fake reviews. Not only do these problems impact our daily activities, but they also deeply affect the economies and political life of our countries. The goal of this project is to investigate methods to assess the trustworthiness of online identities and content (e.g., check whether a particular restaurant review is fake or real or whether an individual is interacting with a legitimate user or an attacker).
More precisely the project consists in investigating whether we can quantify the trustworthiness of online identities in monetary terms using the price of identities in black markets . Popular black-market services like Freelancer and Taobao  allow job postings promising creation of fake identities with different levels of grooming. With the availability of such job posting data, the idea is very simple: Given that the advertised black-market price for an identity groomed to a certain level is X, and that the expected utility that can derived from the activities of this identity is Y , if X > Y , we can expect that a rational attacker will have no incentive to use such an identity for malicious activities. The second step is to measure the extent to which we can increase the accountability of weak identities if we link the (potentially anonymous) identities of users across multiple systems.
Throughout the project the student will be able to familiarize himself with the different ways online systems can be exploited by attacker and possible countermeasures, learn to collect data from online services and apply and analyze this data.
Requirements: Strong coding skills. Experience in working with data is a plus.
This internship will be mentored by Oana Goga.
Internship: Audit the transparency mechanisms provided by social media advertising platforms
Project description: In recent years, targeted advertising has been source of a growing number of privacy complaints from Internet users . At the heart of the problem lies the opacity of the targeted advertising mechanisms: users do not understand what data advertisers have about them and how this data is being used to select the ads they are being shown. This lack of transparency has begun to catch the attention of policy makers and government regulators, which are increasingly introducing laws requiring transparency .
To enhance transparency, Twitter recently introduced a feature (called “why am I seeing this”) that provides users with an explanation for why they have been targeted a particular ad. While this is a positive step, it is important to check whether these transparency mechanisms are not actually deceiving users leading to more harm than good. The goal of the project is to verify whether the explanations provided by Twitter satisfy basic properties such as completeness, correctness and consistency. The student will need to develop a browser extension or a mobile application that is able to collect the ad explanations received by real-world users on Twitter and to conduct controlled ad campaigns such that we can collect the corresponding ad explanations.
Throughout the project the student will be able to familiarize himself with the online targeted advertising ecosystems, learn to conduct online experiments and measure their impact and conceptually reflect at what is a good explanation.
Requirements: Strong coding skills. Experience in working with data is a plus.
This internship will be mentored by Oana Goga.
Internship: Build a platform for increasing the transparency of social media advertising
Project Description: In recent years, targeted advertising has been source of a growing number of privacy complaints from Internet users . At the heart of the problem lies the opacity of the targeted advertising mechanisms: users do not understand what data advertisers have about them and how this data is being used to select the ads they are being shown. This lack of transparency has begun to catch the attention of policy makers and government regulators, which are increasingly introducing laws requiring transparency .
The project consists in building a tool that provides explanations for why a user has been targeted with a particular ad that does not need the collaboration of the advertising platform. The key idea is to crowdsource the transparency task to users. The project consists in building a collaborative tool where users donate in a privacy-preserving manner data about the ads they receive. The platform needs to aggregate data from multiple users (using machine learning) to infer (statistically) why a user received a particular ad. Intuitively, the tool will group together all users that received the same ad, and look at the most common demographics and interests of users in the group. The key challenge is to identify the limits of what we can statistically infer from such a platform.
Throughout the project the student will be able to familiarize himself with the online targeted advertising ecosystems, learn to collect data from online services and apply machine learning and statistics concepts on real-world data.
Requirements: Strong coding skills and strong background in statistics and data mining.
This internship will be mentored by Oana Goga
Internship: Crowd Curation in Crowdsourcing
Crowdsourcing has been around for about 10 years. The ability of crowdsourcing platforms to “say anything” about their workforce is becoming a selling argument. For instance, Prolific Academic claims that their workforce is best to deploy survey tasks. In this internship, we are interested in curating workers on a crowdsourcing platform.
Workforce curation is the process of gathering information about workers. A major challenge here is the evolving nature of that information. For instance, workers’ skills change as they complete tasks, their motivation evolves, and their affinity with other workers varies. Therefore, workforce curation must be adaptive to account for the evolution fo workers’ information. In addition, some of this information can be inferred, e.g., skills could be computed from previously completed tasks, and some of it is more accurate if
acquired explicitly from workers, e.g., their motivation at a particular moment in time. Consequently, there is a need to interleave implicit and explicit workforce curation. The challenge here is to determine the moments at which workers are explicitly asked to provide feedback as well as the form under which they provide it.
– define a model for implicit workforce curation
– define a model for explicit workforce curation (at which granularity should we solicit feedback): options include asking workers to choose tasks they would like to complete, or to choose skills they would like to acquire
– define and implement a strategy that interleaves implicit and explicit curation: determine the moments when to switch from implicit to explicit and how that impacts workers’ involvement
– define the semantics of how to aggregate workers’ information into information on the crowd (classification vs regression)
– validation: deploy workforce curation strategy and check several dimensions: is crowdwork quality higher with or w/o workforce curation? is worker retention better? is task throughput better? Other measures?
This internship will be mentored by Sihem Amer-Yahia <firstname.lastname@example.org> and Jean Michel Renders <email@example.com>, Naver Labs former Xerox.
Strong programming skills are required.
Please read the following paper before coming to the interview:
Julien Pilourdault, Sihem Amer-Yahia, Dongwon Lee, Senjuti Basu Roy:Motivation-Aware Task Assignment in Crowdsourcing. EDBT 2017: 246-257
This internship will be mentored by Sihem Amer-Yahia <firstname.lastname@example.org>
Internship: CrowdFair: A Tool to Support Non-Discrimination in Crowdsourcing
We address the issue of discrimination in crowdsourcing with the objective of developing a model and algorithms to test discrimination in the assignment of tasks by requesters to the workers. Algorithmic discrimination is a recent topic that is growing in machine learning (bias in decision-making) and data mining (bias in trends) . The input and output decision spaces are assumed to be metric spaces to calculate discrimination as distances.
Although several definitions of discrimination exist, the most widely used approach today is to partition the input data so that an algorithm does not discriminate between partitions    5]  . For example, a partition on gender would make it possible to say whether a decision algorithm (eg admission to a university) discriminates on the basis of gender by calculating the distance between the input data, in this case males and females, and decisions, in this case, admitted or not. This is referred to as distortion and distortion tolerance when different decisions for different partitions are compared.
The objective of this internship to develop CrowdFair, a tool to help requesters answer questions such as: for a set of participants and a set of possible task assignment functions, find the best partitioning of the participants for each function, ie, the one offering the least discrimination for each function. We would then like to study this question when the set of workers evolves.
This internship will be mentored by Sihem Amer-Yahia <email@example.com> and Shady Elbassuoni from the American University in Beirut
Internship: Ranked Temporal Joins on StreamsWe study a particular kind of join, coined Ranked Temporal Join (RTJ), featuring predicates that compare time intervals and a scoring function associated with each predicate to quantify how well it is satisfied. RTJ queries are prevalent in avariety of applications such as network trafic monitoring, task scheduling, and tweet analysis.Step 1. The candidate will be asked to run a distributed join algorithm for n-ary RTJ queries that was implemented in Step 2. The candidate will design a streaming version of the join algorithm and implement it.Step 3. The candidate will implement the designed algorithm in Storm/Spark/Flink.Prerequisites: Strong implementation skills. Familiarity with Big Data infrastructures such as Spark or Flink is appreciated.This internship will be co-advised by Sihem Amer-Yahia, DR CNRS@LIG and Vincent Leroy, Professor@UGA. It may lead to a PhD at SAP.Work based on the following paper. The candidate must read it before coming to the interview:Julien Pilourdault, Vincent Leroy, Sihem Amer-Yahia: Distributed Evaluation of Top-k Temporal Joins. SIGMOD Conference 2016: 1027-1039
PhD Position in Fault Localization and Explanation for Concurrent Programs