Oncoming/Ongoing | Expired positions

Oncoming/Ongoing positions

  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: Expressing complex SPARQL queries in a restricted natural language controlled by a domain ontology: application to ontology-based student progress monitoring in SIDES 3.0.

    SPARQL is a powerful formal query language that has been designed to formalize (possibly complex) questions asked to semantic knowledge bases. The many SPARQL features and constructs make it difficult for end-users to express their information request using the formal syntax of SPARQL queries. With the increasing amount of information stored in ontology-based (encyclopedic or domain specific) knowledge bases, it is important to enable end-users not familiar with the SPARQL syntax to express their (possibly complex) questions using (possibly complex) sentences.

    Without any restriction on the sentences in natural language (e.g., in English), the translation task from natural language to SPARQL is very hard, if not impossible. Existing works [1,2,3] have explored the usage of a controlled natural language and the restriction to a certain types of questions to build user-friendy query interfaces.

    The objective of this research project is to extend the existing approaches in order to propose a restricted natural language with a good trade-off between the expressivity required for covering as many SPARQL 1.1 features as possible (in particular GROUP-BY and FILTER constructs) and a good guidance for suggesting the domain-specific vocabulary terms that make sense for the query under construction.

    The proposal will be evaluated in the setting of the SIDES 3.0: an ontology-based e-learning platform for training and monitoring students in Medicine. The construction of SIDES 3.0 is a national project funded by ANR for 3 years. The target users are students and teachers of all the Medical Schools in France.

    A distinguishing point from existing works [1,2,3] is that we will have to take into account the specificities of the French language in designing the query interface in French. A first step will be to adapt to French the existing query interfaces in (restricted and controlled) English proposed in [1,2,3].

    A gool level in French is thus required.

    [1] E. Kaufmann and A. Bernstein. Evaluating the usability of natural language query languages and interfaces to semantic web knowledge bases. J. Web Semantics, 8(4):377–393, 2010.
    [2] S. Ferré. Expressive and scalable query-based faceted search over SPARQL endpoints. In P. Mika and T. Tudorache, editors, Int. Semantic Web Conf. Springer, 2014.
    [3] S. Ferré. SPARKLIS: An Expressive Query Builder for SPARQL Endpoints with Guidance in Natural LanguageSemantic Web Journal, vol. 8, no. 3, pp. 405-418, 2017

    Co-supervisors: Fabrice Jouanot (LIG, SLIDE), Olivier Palombi (LJK, Inria, Imagine) and Marie-Christine Rousset (LIG, SLIDE)

  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: Ontology-based Query Answering: comparison between a Query Rewriting approach and a Materialization approach in SIDES 3.0 setting.

    SIDES 3.0 is an ontology-based e-learning platform for training and monitoring students in Medicine. It is under construction as a national project funded by ANR for 3 years.

    SIDES 3.0 knowledge base is composed by a lightweight domain ontology made of RDFS statements as well as some domain-specific rules, and a huge amount of RDF facts on instances of the classes and properties in the ontology.

    Query answering through ontologies is more complex than for classical databases because answers to queries may have to be inferred from the data and the ontological statements, thus requiring reasoning techniques to be employed.

    Two approaches have been designed for ontology-based query answering:

    • Query rewriting consists in exploiting the ontological statements to reformulate the input query into a union of queries the evaluation of which can then be done directly against the data to obtain the set of answers to the input query.

    • Materialization consists in exploiting the ontological statements as rules in order to saturate the data by inferring all the possible facts that can be derived from the input data set and the ontological statements, and then to evaluate the input query against the saturated dataset.

    Most of existing works on query rewriting have considered simple conjunctive queries, which correspond to so-called graph pattern queries in SPARQL. However, SPARQL is a powerful formal query language with many features and constructs that go beyond the expressivity of graph pattern queries.

    The objective of this research project is to compare the query rewriting approach and the materialization approach in the setting of SIDES 3.0 in which some features of SPARQL 1.1 (in particular GROUP-BY and FILTER constructs) are extensively used to express queries.

    For making the comparison possible, it will be necessary first to study the problem of query rewriting for aggregation queries, and to design query rewriting techniques for these queries.

    Co-supervisors: Fabrice Jouanot (LIG, SLIDE), Olivier Palombi (LJK, Inria, Imagine) and Marie-Christine Rousset (LIG, SLIDE)


  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: Assessing the trustworthiness of online identities

    Project Description: The lack of strong identities, i.e., secure and verifiable identities that are backed by a certificate from a trusted authority (e.g., passport or social security number) has been a long-standing problem on the Web. While strong identities could provide better security for Web services they failed to achieve mass adoption because they significantly raise the sign-on barrier for new users and raise privacy concerns – users cannot be anonymous. Consequently, most Web services today provide weak identities that are not backed by any certificate. Unfortunately, weak identities are heavily exploited by malefactors to create multiple fake identities with ease. Such fake identities are traditionally known as Sybils [1] and are typically used to inundate services with spam, spread fake news, or post fake reviews. Not only do these problems impact our daily activities, but they also deeply affect the economies and political life of our countries. The goal of this project is to investigate methods to assess the trustworthiness of online identities and content (e.g., check whether a particular restaurant review is fake or real or whether an individual is interacting with a legitimate user or an attacker).

    More precisely the project consists in investigating whether we can quantify the trustworthiness of online identities in monetary terms using the price of identities in black markets [2]. Popular black-market services like Freelancer and Taobao [6] allow job postings promising creation of fake identities with different levels of grooming. With the availability of such job posting data, the idea is very simple: Given that the advertised black-market price for an identity groomed to a certain level is X, and that the expected utility that can derived from the activities of this identity is Y , if X > Y , we can expect that a rational attacker will have no incentive to use such an identity for malicious activities. The second step is to measure the extent to which we can increase the accountability of weak identities if we link the (potentially anonymous) identities of users across multiple systems.

    Throughout the project the student will be able to familiarize himself with the different ways online systems can be exploited by attacker and possible countermeasures, learn to collect data from online services and apply and analyze this data.

    Requirements: Strong coding skills. Experience in working with data is a plus.

    More details: https://people.mpi-sws.org/~ogoga/offers/Internship_2017_trust.pdf

    This internship will be mentored by Oana Goga.

  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: Audit the transparency mechanisms provided by social media advertising platforms

    Project description: In recent years, targeted advertising has been source of a growing number of privacy complaints from Internet users [1]. At the heart of the problem lies the opacity of the targeted advertising mechanisms: users do not understand what data advertisers have about them and how this data is being used to select the ads they are being shown. This lack of transparency has begun to catch the attention of policy makers and government regulators, which are increasingly introducing laws requiring transparency [2].

    To enhance transparency, Twitter recently introduced a feature (called “why am I seeing this”) that provides users with an explanation for why they have been targeted a particular ad. While this is a positive step, it is important to check whether these transparency mechanisms are not actually deceiving users leading to more harm than good. The goal of the project is to verify whether the explanations provided by Twitter satisfy basic properties such as completeness, correctness and consistency. The student will need to develop a browser extension or a mobile application that is able to collect the ad explanations received by real-world users on Twitter and to conduct controlled ad campaigns such that we can collect the corresponding ad explanations.

    Throughout the project the student will be able to familiarize himself with the online targeted advertising ecosystems, learn to conduct online experiments and measure their impact and conceptually reflect at what is a good explanation.

    Requirements: Strong coding skills. Experience in working with data is a plus.

    More details: https://people.mpi-sws.org/~ogoga/offers/Internship_2017_Twitter_explanations.pdf

    This internship will be mentored by Oana Goga.

  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: Build a platform for increasing the transparency of social media advertising

    Project Description: In recent years, targeted advertising has been source of a growing number of privacy complaints from Internet users [1]. At the heart of the problem lies the opacity of the targeted advertising mechanisms: users do not understand what data advertisers have about them and how this data is being used to select the ads they are being shown. This lack of transparency has begun to catch the attention of policy makers and government regulators, which are increasingly introducing laws requiring transparency [2].

    The project consists in building a tool that provides explanations for why a user has been targeted with a particular ad that does not need the collaboration of the advertising platform. The key idea is to crowdsource the transparency task to users. The project consists in building a collaborative tool where users donate in a privacy-preserving manner data about the ads they receive. The platform needs to aggregate data from multiple users (using machine learning) to infer (statistically) why a user received a particular ad. Intuitively, the tool will group together all users that received the same ad, and look at the most common demographics and interests of users in the group. The key challenge is to identify the limits of what we can statistically infer from such a platform.

    Throughout the project the student will be able to familiarize himself with the online targeted advertising ecosystems, learn to collect data from online services and apply machine learning and statistics concepts on real-world data.

    Requirements: Strong coding skills and strong background in statistics and data mining.

    More details: https://people.mpi-sws.org/~ogoga/offers/Internship_2017_crowdsouring.pdf

    This internship will be mentored by Oana Goga

  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: An Optimization Framework for Text Summarization

    The social Web is populated with user-generated reviews for restaurants/movies/other points of interest.  Yet, it is difficult for Web users to extract meaningful content from those reviews.  We propose to generate review summaries of movies, restaurants, and Points Of Interest (POIs).

    Wouldn’t it be wonderful to have a system that tells you in a few words what most people think about a movie, a restaurant, or a museum? Wouldn’t it be even greater if such a system could do it in a language you understand even if the reviews are written in different languages (when you’re traveling for example)?

    While there exists a large number of automatic text translation and summarization tools, their results remain poor on user-generated content. In fact, humans are best at summarizing text but they can only do it for a small number of reviews, for obvious reasons.  In this project, we want to develop a hybrid approach to generate meaningful
    textual summaries from multilingual reviews. We want to achieve both the scalability of automated approaches and the coherency of manually-crafted summaries by exploring and testing different strategies that combine algorithms with crowdsourcing.

    The steps are:
    – Design different deployment strategies based on the framework developed in [1]. Each strategy is akin to a plan (as in query optimization) and is characterized by its expected cost, outcome quality, and latency. Plans will differ in the degree to which they involve human workers. More specifically, they differ in their work structure, their workforce organization, and in their target crowdsourcing platforms.
    – Design an optimization approach that generates the best plan according to some criteria.
    – Implement and test the effectiveness of different strategies for multilingual text summarization and provide guidelines for the deployment of text summarization tasks.

    Prospective candidates should be good at writing code. They must read the following paper before coming to the interview.

    [1] Ria Mae Borromeo, Maha Alsaysneh, Sihem Amer-Yahia, Vincent Leroy: Crowdsourcing Strategies for Text Creation Tasks. EDBT 2017: 450-453

    This internship will be mentored by Sihem Amer-Yahia <sihem.amer-yahia@imag.fr>

  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: Crowd Curation in Crowdsourcing

    Crowdsourcing has been around for about 10 years. The ability of crowdsourcing platforms to “say anything” about their workforce is becoming a selling argument.  For instance, Prolific Academic claims that their workforce is best to deploy survey tasks. In this internship, we are interested in curating workers on a crowdsourcing platform.

    Workforce curation is the process of gathering information about workers.  A major challenge here is the evolving nature of that information.  For instance, workers’ skills change as they complete tasks, their motivation evolves, and their affinity with other workers varies.  Therefore, workforce curation must be adaptive to account for the evolution fo workers’ information.  In addition, some of this information can be inferred, e.g., skills could be computed from previously completed tasks, and some of it is more accurate if
    acquired explicitly from workers, e.g., their motivation at a particular moment in time.  Consequently, there is a need to interleave implicit and explicit workforce curation. The challenge here is to determine the moments at which workers are explicitly asked to provide feedback as well as the form under which they provide it.

    Tasks:
    – define a model for implicit workforce curation
    – define a model for explicit workforce curation (at which granularity should we solicit feedback): options include asking workers to choose tasks they would like to complete, or to choose skills they would like to acquire
    – define and implement a strategy that interleaves implicit and explicit curation: determine the moments when to switch from implicit to explicit and how that impacts workers’ involvement
    – define the semantics of how to aggregate workers’ information into information on the crowd (classification vs regression)
    – validation: deploy workforce curation strategy and check several dimensions: is crowdwork quality higher with or w/o workforce curation? is worker retention better? is task throughput better? Other measures?

    This internship will be mentored by Sihem Amer-Yahia <sihem.amer-yahia@imag.fr> and Jean Michel Renders <jean-michel.renders@naverlabs.com>, Naver Labs former Xerox.

    Strong programming skills are required.

    Please read the following paper before coming to the interview:
    Julien Pilourdault, Sihem Amer-Yahia, Dongwon Lee, Senjuti Basu Roy:Motivation-Aware Task Assignment in Crowdsourcing. EDBT 2017: 246-257

    This internship will be mentored by Sihem Amer-Yahia <sihem.amer-yahia@imag.fr>

  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: CrowdFair: A Tool to Support Non-Discrimination in Crowdsourcing

    We address the issue of discrimination in crowdsourcing with the objective of developing a model and algorithms to test discrimination in the assignment of tasks by requesters to the workers. Algorithmic discrimination is a recent topic that is growing in machine learning (bias in decision-making) and data mining (bias in trends) [1]. The input and  output decision spaces are assumed to be metric spaces to calculate discrimination as distances.
    Although several definitions of discrimination exist, the most widely used approach today is to partition the input data so that an algorithm does not discriminate between partitions [2] [3] [4] 5] [6] [7]. For example, a partition on gender would make it possible to say whether a decision algorithm (eg admission to a university) discriminates on the basis of gender by calculating the distance between the input data, in this case males and females, and decisions, in this case, admitted or not. This is referred to as distortion and distortion tolerance when different decisions for different partitions are compared.

    The objective of this internship to develop CrowdFair, a tool to help requesters answer questions such as: for a set of participants and a set of possible task assignment functions, find the best partitioning of the participants for each function, ie, the one offering the least discrimination for each function. We would then like to study this question when the set of workers evolves.

    This internship will be mentored by Sihem Amer-Yahia <sihem.amer-yahia@imag.fr> and Shady Elbassuoni from the American University in Beirut

  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: Data Pipelines for Large-Scale User Data Analytics

    User data is characterized by a combination of demographics such as usage and occupation, and user actions such as rating a movie, reviewing a restaurant or buying groceries. The ability to analyze user data serves data scientists to conduct large-scale population studies, and gain insights on various population segments. As a result, several approaches have been developed to find informative user segments in user data. However, there does not exist an end-to-end solution that offers ease of use and high performance for large-scale user data mining. Ease of use is of particular importance here as several approaches need to be tested to mine user segments.

    The general goal of this research project is to design a productive and effective programming framework for the development of data pipelines to analyze user data and extract useful user segments. The approach followed is to propose a separation between logical and physical operators to model the analysis of user data. Logical operators capture fundamental operations required for data preparation and mining, whereas physical operators provide alternative implementations of the logical operators that achieve different levels of effectiveness defined by the quality of resulting user segments [2]. Optimization methods can then be devised to improve the overall response time and effectiveness of a data pipeline for the analysis of user data. We note that a similar approach has been taken by systems like SystemML and KeystoneML for the development of machine learning data pipelines with a particular focus on response time. For instance, in [1], logical operators are tailored to the training and application of models whereas optimization techniques perform both per-operator optimization and end-to-end pipeline optimization using a cost-based optimizer that accounts for both computation and communication costs. By contrast, our optimization goal is the effectiveness of the data pipeline without compromising response.

    The focus on effectiveness requires to adopt a data-driven approach in which different semantics of the same pipeline can be evaluated with various physical operators. The quality of user segments relies on optimizing one or several dimensions ranging from coverage to uniformity, and diversity. For example, the task of finding user segments with uniform ratings can be implemented in a physical operator that solves an optimization problem. Such a problem is formulated as finding the K most uniform segments whose coverage of input data exceeds a threshold [3]. Alternatively, it could be formulated as finding the K most uniform and most diverse user segments [4]. In this internship, we will specifically explore the combination of such physical operators with pipeline-level optimizations in order to extract high-quality segments in reasonable time.

    The program of this student project covers the following steps:

    1. Identify common tasks required for the processing and mining of user data, starting from a review of [2-5].
    2. Design primitive and generic logical operators to express data pipelines for data preparation and mining. The design will also consider data preparation, cleaning, and enrichment already available in the SAP Data Hub platform.
    3. Develop per-operator (e.g., choice of optimization dimension) and pipeline-level (e.g., operator reordering) optimizations with the specific goal of extracting user segments of varying quality.
    4. Propose and realize different implementations of proposed pipelines for testing purposes. Their usage will be demonstrated on the implementation of two specific pipelines, namely finding segments with uniform ratings or segments with polarized ratings as described in [1].

    Prerequisites: Strong implementation skills. Familiarity with Big Data infrastructures such as Spark or Flink is appreciated.

    This internship will be co-advised by Sihem Amer-Yahia, DR CNRS@LIG and Eric Simon, Chief Architect@SAP.

    References

    [1] Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J. Franklin, Benjamin Recht: KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. ICDE 2017: 535-546
    [2] Sihem Amer-Yahia, Laks V. S. Lakshmanan, Cong Yu: SocialScope: Enabling Information Discovery on Social Content Sites. CIDR 2009
    [3] M. Das, S. Amer-Yahia, G. Das, and C. Yu. MRI: meaningful interpretations of collaborative ratings. PVLDB, 4(11):1063-1074, 2011.
    [4] Sihem Amer-Yahia, Sofia Kleisarchaki, Naresh Kumar Kolloju, Laks V. S. Lakshmanan, Ruben H. Zamar: Exploring Rated Datasets with Rating Maps. WWW 2017: 1411-1419
    [5] Behrooz Omidvar-Tehrani, Sihem Amer-Yahia, Pierre-François Dutot, Denis Trystram: Multi-Objective Group Discovery on the Social Web. ECML/PKDD (1) 2016: 296-312

  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: Interactive Exploration of Health Trajectories in Viz4LIFE

    In the context of the CDP LIFE project, we collaborate with doctors of the CHU of Grenoble, specialists in respiratory diseases. These physicians would like to follow the health trajectories of patients with various respiratory diseases such as sleep apnea, asthma, bronchiolitis and liver cancer. The traditional method is the constitution of (physical!) cohorts and their follow-up over long or short periods of time. This method, although effective, is expensive and observations on one cohort cannot be generalized to others. What our CHU partners would like to have is a tool allowing them to build and analyze not physical but virtual cohorts, ie, using an interface allowing them to build cohorts over time, to see how their health has evolved over a period of their choice, and explore different explanations for this evolution.

    The objective of this work is to develop Viz4LIFE, a tool for visualizing health trajectories that explores several explanations that have led to these trajectories. For this purpose, we have external data sources: pollution data, tweets, and receipt data to infer dietary habits by geographical region. We will therefore initially focus on following geographical cohorts (eg patients living in the center of Grenoble, patients at 800m altitude, patients living south of Grenoble) and develop a visual interface.

    Recommended readings:

    – Zhang, Haopeng, Yanlei Diao, and Alexandra Meliou. “EXstream: Explaining Anomalies in Event Stream Monitoring.” In EDBT, pp. 156-167. 2017.
    http://www.lix.polytechnique.fr/~yanlei.diao/publications/xstream-edbt2017.pdf
    – El Gebaly, Kareem, Parag Agrawal, Lukasz Golab, Flip Korn, and Divesh Srivastava. “Interpretable and informative explanations of outcomes.” Proceedings of the VLDB Endowment 8, no. 1 (2014): 61-72.
    Meliou, Alexandra, Sudeepa Roy, and Dan Suciu. “Causality and explanations in databases.” Proceedings of the VLDB Endowment 7, no. 13 (2014): 1715-1716.
    http://www.vldb.org/pvldb/vol7/p1715-meliou.pdf
    – Hlavá?ková-Schindler, Katerina, Milan Paluš, Martin Vejmelka, and Joydeep Bhattacharya. “Causality detection based on information-theoretic approaches in time series analysis.” Physics Reports 441, no. 1 (2007): 1-46.
    This internship will be mentored by Sihem Amer-Yahia <sihem.amer-yahia@imag.fr>
  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: Ranked Temporal Joins on Streams

    We study a particular kind of join, coined Ranked Temporal Join (RTJ), featuring predicates that compare time intervals and a scoring function associated with each predicate to quantify how well it is satisfied. RTJ queries are prevalent in a
    variety of applications such as network trafic monitoring, task scheduling, and tweet analysis.
    Step 1. The candidate will be asked to run a distributed join algorithm for n-ary RTJ queries that was implemented in [1]
    Step 2. The candidate will design a streaming version of the join algorithm and implement it.
    Step 3. The candidate will implement the designed algorithm in Storm/Spark/Flink.
    Prerequisites: Strong implementation skills. Familiarity with Big Data infrastructures such as Spark or Flink is appreciated.
    This internship will be co-advised by Sihem Amer-Yahia, DR CNRS@LIG and Vincent Leroy, Professor@UGA. It may lead to a PhD at SAP.
    Work based on the following paper. The candidate must read it before coming to the interview:
    Julien Pilourdault, Vincent Leroy, Sihem Amer-Yahia: Distributed Evaluation of Top-k Temporal Joins. SIGMOD Conference 2016: 1027-1039
  • Thu
    01
    Feb
    2018
    Sat
    30
    Jun
    2018

    Internship: Mining Knowledge Graphs

    Knowledge bases, such as YAGO and the Google Knowledge Graph, contain hundreds of millions of facts (edges) about millions of entities (vertices). It is possible to find patterns (regularities) in these graphs to infer new knowledge. For instance, we can learn that if a person P works at W, and W is located in the city C, then P is likely to live in the same region as C. These graph-based association rules produce new facts in the database, and are thus valuable.

    The state of the art in knowledge graph mining is based on SQL join queries that can be parallelized with Spark.

    1. Yang Chen, Sean Goldberg, Daisy Zhe Wang, and Soumitra Siddharth Johri.2016. Ontological Pathfinding. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16). ACM, 835–846.
    2. Luis Galárraga, Christina Teioudi, Katja Hose, and Fabian M. Suchanek.2015. Fast Rule Mining in Ontological Knowledge Bases with AMIE+. The VLDB Journal 24, 6 (Dec. 2015), 707–730.

     However, in the field of pattern mining, specific frequent graph mining algorithms have been developed to efficiently find repeated sub-graphs in a larger structure.
    1. Ehab Abdelhamid, Ibrahim Abdelaziz, Panos Kalnis, Zuhair Khayyat, and Fuad Jamour. 2016. Scalemine: Scalable Parallel Frequent Subgraph Mining in a Single Large Graph. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’16).
    2. Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Sera ni, Georgos Siganos, Mohammed J. Zaki, and Ashraf Aboulnaga. 2015. Arabesque: A System for Distributed Graph Mining. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP ’15). ACM, 425–440.

    The goal of this internship is to integrate state of the art graph mining techniques in the context of knowledge graph mining. The graph mining approach used has been recently developed by the SLIDE team. The work will consist in specializing this existing algorithm for the context of knowledge graphs, and evaluating its performance.
    This internship will be advised by Vincent Leroy.