SLIDE: Scalable Information Discovery and Data Ethics

SLIDE Research Topics

Ethics and Privacy. This axis refers to models, methods and algorithms for data anonymization, privacy and security of social systems, algorithmic fairness, and transparency.
Large-Scale Data Analytics. This axis refers to a collection of approaches and algorithms used to extract value from data, with a new emphasis on building full-ﬂedged systems with guided multi-modal analytics.
Information Exploration. This axis refers to a collection of scalable approaches for exploring information, with a new emphasis on interactive data exploration and ontology-basedreasoning on data, and data visualization.

Our research is applied to a variety of datasets ranging from personal data to execution traces, and data from the semantic Web, from medical or education domain. A large number of our applications are evaluated using traditional information retrieval and learning techniques, and crowdsourcing.

Ethics and Privacy

The overarching goal of this research axis is to provide methods and tools for a more dependable user-centric future Internet, where users are comfortable to share data because they are assured that their privacy is protected and that their data is handled correctly, and they trust the services provided by such systems.

Privacy

We aim to provide tools and methods that allow users and organizations to mea sure and control their online image (e.g., allow users to understand what data a potential employer could mine about them online). We plan to work on assessing identity and attribute disclosure risks in current systems and building defenses against such risks.

Privacy-preserving data publishing

We aim to develop a query-based approach for Linked Open Data anonymization under the form of delete and update operations on RDF graphs, based on privacy and utility policies speci?ed as SPARQL queries. We will focus on the problem of building safe anonymizations to guarantee that linking the anonymized dataset with any external RDF dataset will not cause privacy breaches. A di?culty to handle will be to deal with sameAs links that can be explicit or inferred by additional knowledge.

Transparency and fairness

We aim to build tools and methods that increase the transparency of current user-centric Internet systems and help users understand why they received a particular service (e.g., why am I being targeted with this particular ad? or why am I receiving this particular job o?er?) and help domain owners identify various cognitive biases in their data-driven decision process.

Trust

We aim to provide tools and methods to assess the trustworthiness of online identities and content (e.g., check whether a particular restaurant review is fake or real or whether an individual is interacting with a legitimate user or an attacker). Assessing the trustworthiness of online identities is di?cult because users operate behind weak identities – identities that can be created by a user without providing any proof of the user’s real (offline) identity. The lack of trusted references for weak identities makes it hard to hold their owners accountable and hence protect the system from attackers. Transparency and fairness on the social Web.

Thus, we leverage a variety of inter-disciplinary methods drawn from different sub-disciplines in computer, information and social science: machine learning, entity matching, information retrieval, image processing, natural language processing, network science, surveys and user studies. The work is expected to have impact in three communities: security and privacy, internet measurement, and data mining.

Large–scale data analytics

We aim to develop scalable solutions (models, algorithms) to extract actionable insights from data.

Knowledge graph enrichment and interlinking

From the early days, the semantic web has promoted a graph-based representation of data and knowledge by pushing the RDF standard. Knowledge graphs are modern knowledge representation formalisms in which entities, which are nodes of the graph, are connected by relations, which are edges of the graph. In addition, the entity types and relations are organized in an ontology which defines their interrelations and restrictions. With the advent of Linked Open Data, knowledge graphs can be interlinked using several types of links (sameAs, seeAlso, relatedTO, etc …). In continuation of the SIDES 3.0 project, we will investigate supervised and unsupervised machine learning techniques for the automatic completion of missing links between questions and specialities or learning objectives within the OntoSIDES knowledge graph. We will also exploit information retrieval techniques to interlink OntoSIDES with standard medical ontologies (in particular SNOMED and MESH). In the setting of ELKER, we will extend our previous work on rule-based methods for data interlinking to come up with a rule reasoner that, from link keys and other kinds of knowledge, together with data, will generate new links.

Virtual Marketplaces

The rapid development of professional social networks and online labor markets, is affecting the future of jobs and workers. Professional social networks such as LinkedIn, are revolutionizing hiring practices. An increasing number of individuals rely on such networks to find jobs, and it is becoming common practice for head hunters and companies to examine one’s profile on LinkedIn before contacting or hiring someone. Online job marketplaces are gaining popularity as mediums to hire people to perform certain tasks. These marketplaces include freelancing platforms such as Qapa and MisterTemp’ in France, and TaskRabbit and Fiverr in the USA.
We aim to create a convergence between various discipline-specific research activities that could be characterized to be part of online labor markets and advocate for a unified framework that is interdisciplinary in nature and requires convergence of different research disciplines. Such a framework could bring transformative effect on the nexus of humans, technology, and the future of work. We will revisit seminal and prominent works that came out from different research communities (social science, machine learning, data management, psychology and organizational research, and economics) on the topic of online labor markets, describe their approaches and characterize their impact on science, society and industry. We will define and implement the requirements of a unified framework that has the potential to combine the best of all these worlds and address modeling, data management and algorithmic challenges to conceptualize such a framework.

Social media and e-learning analytics

We aim to develop approaches that extract valu able information from user-generated content. We are particularly interested in extracting
health-related social media and e-learning platforms.

In the context of the SIDES 3.0 e-learning platform, from the data describing the medical students training activities, we will perform on-demand information extraction at different granularities and driven by end-users analysts. A first use-case will be to define explainable criteria that are computable from data for a fine-grained analysis of the quality, difficulty, discriminatory power of questions inside a pool of questions, or of distractors inside a question. A second use-case will be to define and compute student trajectories over time (or for a group of students) related to items from the referential or to medical specialities.

In the context of the IDEX CDP LIFE project, we collaborate with medical experts at the hospital in Grenoble to developed a system that monitors and classifies health-related tweets. We developed methods to uncover ailments over time from social media. We formulated health transition detection and prediction problems and proposed two models to solve them. Detection is addressed with TM–ATAM, a granularity-based model to conduct region-specific analysis that leads to the identification of time periods and characterizing homogeneous disease discourse, per region. Prediction is addressed with T–ATAM, that treats time natively as a random variable whose values are drawn from a multinomial distribution. The fine-grained nature of T–ATAM results in significant improvements in modeling and predicting transitions of health-related tweets. We believe our approach is applicable to other domains with time-sensitive health topics. In particular, we will be analyzing tweets related to sleep apnea and liver diseases, two health topics that are of interest to our medical partners in Grenoble.

Information Exploration

Exploration is the process through which users find useful data and information. We present three topics related to our research plan in this domain.

Ontology-based data integration and exploration

In the setting of SIDES 3.0 and CQFD projects, we aim to track students history over years, which requires to integrate heterogeneous data from different sources. More generally, we aim to develop scalable methods for answering ontology-mediated complex temporal queries, which is a requirement in many applications.

Interactive exploration

We aim to develop effective data and information exploration methods and apply our methods in di?erent domains. With the increase in health-care data in various sectors (e.g., prognoses, treatments, hospitalizations and compliances), medical experts require effective analysis methods to understand the evolution of their patients’ health. Medical cohort analysis exhibits the collective behavior of patients, providing insights on the evolution of their health conditions and their reaction to treatments. Cohort analysis serves various goals such as augmenting treatment effectiveness, patient satisfaction, and health-care revenue. In medical cohort analysis, experts seek answers to three recurring questions: “what will happen next?”, “what has happened?”, and “what happened to similar cohorts?” The first axis relates to predicting the future status of patients in a cohort based on their past. The second axis relates to finding and conveying a representation of patient data in a human-understandable way. The third axis relates to exploring the data to suggest similar cohorts to experts and help them ?nd di?erences and contextualize decisions. We will develop interactive exploration methods that blend all three axes to obtain answers for questions such as “which sequence of treatments is the most relevant to an outcome, e.g., death?”, “what changes in Body Mass Index (abbr., BMI) lead to death?”, “which treatment for some liver disease keeps patients’ fatigue level at its lowest?” The novelty will be in enabling interactivity with the data to address prediction questions.

Data visualization

We develop visual enablers for data exploration to help analysts, be they novice analysts or domain experts, acquire an understanding of their data and extract actionable insights. Our visualization is data-driven and does not require analysts to know the underlying distributions in their data. While automated systems can identify and suggest potentially interesting data and information to explore, they can do that for well-specified needs (e.g., through SQL queries or constrained mining). Our visual enables help analysts filter and refine their exploration as they discover what lies in the data. This exploration leads them to easily acquire statistics about their data and find similar and dissimilar data. While most visual analytics systems are data-dependent, our enablers rely on a data model to integrate multiple components from data ingestion to testing the impact of insights. Our visual enablers cater to analysts with varying levels of expertise. Novice analysts are generally interested in completing daily tasks such as finding a movie. For that, they need to find people like them and alternate between a user-centric view and a group-centric view of the data. They also need to explore individual and collective interests to reach a decision. Domain experts, on the other hand, tend to look for validating assumptions on their data. Mixed-initiative systems will bring users in the loop and help them refine their needs as they interact with the data.