Leveraging the structure of uncertain data

Uncertainty on Ordered Data

The second axis of my PhD research concerns uncertainty on data with an order on tuples or on numerical values. This work is not presented in this manuscript: the first subsection is available as a preprint [Amarilli, Ba, Deutch, and Senellart 2016]; the second is also available as a preprint [Amarilli, Amsterdamer, Milo, and Senellart 2016] and was sketched at the UnCrowd workshop [Amarilli, Amsterdamer, and Milo 2014b].

There are many data management contexts where we must take into account an order relation on data values (such as dates or numerical numbers), or on the tuples themselves (e.g., a list of log entries, or a ranked list of results). This section present my work on uncertainty for order representation, first in a setting with uncertain order on relational tuples, and then in a setting with partially ordered numerical values.

Representing Order Uncertainty on Relational Tuples

To manage ordered tuples in the relational model, an interesting phenomenon is that uncertainty on the order arises even when we apply the standard relational algebra operators to input relations that are certain in terms of order. For a practical example, consider a travel website where you can search for hotels and which ranks them by quality. You are a party of four, and you want either two twin rooms, or one room with four beds; but the website does not allow you to search for both possibilities at once. So you perform one search, and then the other, and you obtain two ordered lists of hotels, of which you want to compute the union. However, the order on the union list is uncertain: it depends on the website’s estimation of quality, which you do not control. It is not fully unspecified, however, because two hotels that occurred only in the first list, or only in the second list, would probably keep the same order in the union.

To address this phenomenon, our work [Amarilli, Ba, Deutch, and Senellart 2016] proposes operators for the relational algebra that apply to partially ordered relations, under the bag semantics, in the spirit of [Grumbach and Milo 1999]. This allows us to combine totally and partially ordered relations with the relational algebra, to integrate data and compute new ordered relations, representing all possible consistent ordering choices. This resembles rank aggregation techniques [Fagin, Lotem, and Naor 2001; Jacob, Kimelfeld, and Stoyanovich 2014; Dwork, Kumar, Naor, and Sivakumar 2001] to reconcile ordered lists of results, but these methods are quantitative, i.e., if something appears close to the top in most lists, then it should also do so in the result. We focus instead on ways to represent all consistent ordering choices. Of course, a straightforward approach for this would be to use existing relational uncertainty frameworks, such as c-tables, to represent the uncertainty on tuple positions, but this would not work well: if we know that a result must come before another one, the dependency on numerical ranks becomes very tedious to represent.

The semantics that we define for relational algebra thus allows us to combine ordered data in a principled way. However, order is maintained as implicit information, and cannot be queried by our operators. To address this, we propose a general order-dependent accumulation operator on relations with uncertain order, which we can use to compute what we want to find out about the order. Accumulation simply computes all possible concatenations of tuple values, for all possible orders, in a given monoid. We can use it for queries on the order, e.g., “is it certain that all hotels in this district are better than all hotels in that district?”

The main technical contribution of our work is to study the complexity of query evaluation in this scheme, more specifically, the complexity of computing possible and certain answers for instance possibility and certainty [Antova, Koch, and Olteanu 2007], and for aggregation. Surprisingly, it is already intractable, i.e., respectively NP-hard and coNP-hard, to determine whether a query result (i.e., a full totally ordered relation) is possible or certain, even for very simple queries, though certainty (unlike possibility) is tractable in the absence of accumulation, and more generally when accumulating in a cancellative monoid.

We therefore show how hardness can be mitigated by restricting the structure of the input ordered relations, using order-theoretic measures. We show tractability of our problems, for some subset of the operators, when the input relations have constant poset width [Brandstädt, Le, and Spinrad 1987]: we show how such bounds are preserved on query results, and show the tractability of possible and certain answers in this context, through a dynamic algorithm. We show a similar result when the input relations are almost totally unordered, introducing the new poset measure of ia-width to define this. We also extend our results under an additional duplicate elimination operator, to go back from bag semantics to set semantics.

Completing Missing Numerical Values

In different work, I have focused [Amarilli, Amsterdamer, Milo, and Senellart 2016] on top-k query evaluation on numerical values which are unknown, but for which we know partial order constraints and some exact values. This work is inspired by crowd data sourcing scenarios [Amsterdamer, Grossman, Milo, and Senellart 2013; Parameswaran et al. 2012], where we wish to extract information from a crowd of users by asking questions. It is often the case that we wish to extract many interdependent numerical values from the crowd. For instance, to consolidate the catalog of a Web store, we may wish to determine the compatibility of each item in the catalog with each category of a taxonomy of products, where we assume that compatibility scores are partially ordered following the taxonomy: the “shirt” category is more specific than the “clothing” category, so if an item fits in the “shirt” category, then it must also fit in the “clothing” category. The simplest approach would be to query the crowd about each of the categories, but in general we cannot afford to do that. Indeed, every crowd query that you make introduces latency (you need to wait for the answers to arrive) and monetary costs (you have to pay the workers). For this reason, we studied principled ways to interpolate the values that we do not have, using the known partial order structure on them, from those values that we have already obtained from the crowd. The goal is to find the top-k items with the highest expected value (here, the top-k categories in which our product can be filed) given the limited information that we have. While completing a total order of values is well understood and can be done with linear interpolation, it appears that the general question of interpolating on a partial order of unknown values had not been studied before. We propose a definition for this problem as that of computing the expected value of each unknown variable in the convex polytope induced by the partial order constraints, taking the uniform distribution as our prior. We show an algorithm to solve the interpolation problem in FP#P, and show that the task is #P-hard, based on results on partial order theory [Brightwell and Winkler 1991]. We extend this to show that it is already hard to simply decide which item has the highest expected value, even without computing the value. We also show tractable approaches: we can design a fully polynomial randomized approximation scheme for the interpolation problem, using a scheme to sample convex polytopes [Kannan, Lovász, and Simonovits 1997], and we show that interpolation (for our principled definition) is tractable for tree-shaped taxonomies, using a dynamic algorithm.

Le rapport de stage ou le pfe est un document d’analyse, de synthèse et d’évaluation de votre apprentissage, c’est pour cela rapport-gratuit.com propose le téléchargement des modèles complet de projet de fin d’étude, rapport de stage, mémoire, pfe, thèse, pour connaître la méthodologie à avoir et savoir comment construire les parties d’un projet de fin d’étude.

Table des matières

General Introduction
1 Provenance and Probability on Treelike Instances
2 Uncertainty on Ordered Data
3 Decidability for Open-World Query Answering
Structure of the Manuscript
I Provenance and Probability on Treelike Instances
1 Introduction
2 Preliminaries
2.1 Trees and Tree Automata
2.2 Boolean Functions, Formulae, and Circuits
2.3 Instances and Graphs
2.4 Tree Decompositions
2.5 Queries
2.6 Query Evaluation and Probabilities
3 Provenance for Treelike Instances
3.1 Provenance Circuits for Trees
3.2 Rewriting Queries on Treelike Instances to Tree Automata
3.3 Provenance Circuits for Treelike Instances
3.4 Monotone Provenance Circuits
3.5 OBDD and d-DNNF Representations
3.6 Formula Representations
4 Applications to Probabilistic Query Evaluation
4.1 Probability Evaluation on Treelike TID Instances
4.2 CC-Instances
4.3 Applications to Probabilistic Relational Frameworks
4.4 Applications to Probabilistic XML
4.5 Connection to Safe Queries
4.6 Application to Match Counting
5 Extending to General Semirings
5.1 Defining Semiring Provenance
5.2 Semiring Provenance Circuits for Trees
5.3 Semiring Provenance Circuits for Instances
5.4 Going Beyond UCQ6=
6 Lower Bounds
6.1 Dichotomy on Probability Evaluation
6.2 Dichotomy on Non-Probabilistic Evaluation
6.3 Dichotomy on Match Counting
6.4 Dichotomy for OBDD Representations
6.5 Meta-Dichotomy for OBDD Representations
II Finite Open-World Query Answering
7 Introduction
8 Preliminaries
8.1 Instances and Constraints
8.2 Implication and Finite Implication
8.3 Queries and QA
8.4 Details about the UID Chase
9 Main Result and Overall Approach
9.1 Finite Universal Superinstances
9.2 Proof Structure
10 Weak Soundness on Binary Signatures
10.1 Completing Balanced Instances
10.2 Adding Helper Elements
10.3 Putting it Together
11 Weak Soundness on Arbitrary Arity Signatures
11.1 Piecewise Realizations
11.2 Relation-Saturation
11.3 Thrifty Chase Steps
11.4 Relation-Thrifty Completions
12 Ensuring k-Universality
12.1 Aligned Superinstances
12.2 Fact-Saturation
12.3 Fact-Thrifty Steps
12.4 Essentiality
12.5 UID Chase Similarity Theorem
General Conclusion