Scalable Query Processing in Probabilistic Databases

April 2008, on going

SPROUT logo

Today, uncertainty is commonplace in data management scenarios dealing with data integration, sensor readings, information extraction from unstructured sources, and whenever information is manually entered and therefore prone to inaccuracy or partiality. Key challenges in probabilistic data management are to design probabilistic database formalisms that can compactly represent large sets of possible interpretations of uncertain data together with their probability distributions, and to efficiently evaluate queries on very large probabilistic data. Such queries could ask for confidences in data patterns possibly in the presence of additional evidence. The problem of query evaluation in probabilistic databases is still in its infancy. Little is known about which queries can be evaluated in polynomial time, and the few existing evaluation methods employ expensive main-memory algorithms.

The aim of this project is to develop techniques for scalable query processing in probabilistic databases and use them to build a robust query engine called SPROUT ( S calable PRO cessing on U ncertain T ables). We are currently exploring three main research directions.

We are investigating open problems in efficient query evaluation. In particular, we aim at discovering classes of tractable (i.e., computable in polynomial time wrt data complexity) queries on probabilistic databases. The query language under investigation is SQL (and its formal core, relational algebra) extended with uncertainty-aware query constructs to create probabilistic data under various probabilistic data models (such as tuple-independent databases, block-independent disjoint databases, or U-relations of MayBMS).
For the case of intractable queries, we investigate approximate query evaluation. In contrast to exact evaluation, which computes query answers together with their exact confidences, approximate evaluation computes the query answers with approximate confidences. We are working on new techniques for approximate query evaluation that are aware of the query and the input probabilistic database model (tuple-independent, block-independent disjoint, etc).
Our open-source query engine for probabilistic data management systems uses the insights gained from the first two directions. This engine is based on efficient secondary-storage exact and approximate evaluation algorithms for arbitrary queries.

News

[Oct 2013] Congratulations to Robert Fink, who successfully defended his PhD and joined Palantir!

[Sept 2013] Lo Po Tsui was awarded the Hoare prize for best overall performance in his MSc year at Oxford and distinction for his thesis on approximate inference for queries with aggregates on probabilistic databases.

[Sept 2012] Quentin Spencer-Harper was awarded the Gloucester Research Project Prize for best 3rd year CS project 2012. His project investigates the evaluation problem of queries with aggregates on probabilistic databases.

[Feb 2012] Robert Fink will do an internship over the summer of 2012 in the research group of Alon Halevy at Google Inc (Mountain View).

[Sept 2011] Larisa Han won the Hoare prize for the best MSc thesis in Computer Science at Oxford for her thesis on aggregates in probabilistic databases.

[July 1, 2011] A news item about our work on probabilistic databases and the new book has been featured on i-programmer . Thank you!

[June 2, 2011] The first book on the foundations of Probabilistic Databases by Dan Suciu, Dan Olteanu, Christopher Re, and Christoph Koch has been published by Morgan & Claypool!

[June 1, 2011] An article on probabilistic databases authored by Robert and Dan appeared in the 10th anniversary edition of The Digit Magazine, the largest selling technology magazine in India, with a certified readership of 200,000.

[June 1, 2011] An article featuring SPROUT has appeared in the second edition of Inspired Research, the industry letter of the Department of Computer Science at Oxford.

[Oct 2010] Swaroop Rath's MSc thesis on the evaluation of full relational algebra queries was awarded distinction. His overall performance placed him second in the MSc in CS course.

[Oct 2009] Rasmus Wissmann won the Hoare prize for the best overall performance in our MSc in Computer Science at Oxford. His thesis on "Tractable Queries with Inequalities on Probabilistic Databases" was also awarded distinction.

[July 2009] Our SIGMOD'09 paper on tractable conjunctive queries with inequalities passed the SIGMOD'09 repeatability and workability evaluation (RWE). Out of 63 accepted papers, 19 participated in RWE, and 10 papers passed RWE.

[May 2009] Jiewen Huang won a Travel Grant from ACM SIGMOD to attend SIGMOD 2009! Read more here about the grant.

[Sept 2008] Jiewen Huang won the Hoare prize for the best MSc thesis in Computer Science at Oxford. His thesis on "Scalable Query Evaluation for Tuple-Independent Probabilistic Databases" investigates aspects of this project and the results are reported in the SUM'08 and ICDE'09 papers.
Jiewen also won the Hoare prize for the best overall performance in his MSc year at Oxford.

[August 2008] SPROUT prototype released as part of the MayBMS database management system available at maybms.sourceforge.net

System prototype

SPROUT is implemented as an extension of the PostgreSQL backend with secondary-storage and main-memory algorithms for exact or approximate confidence computation. The latest online version of SPROUT is available here (this is the version reported in the ICDE 2010 paper; newer versions are available on request). Major releases of SPROUT are also included in the MayBMS database management system available at maybms.sourceforge.net. The latest MayBMS version (that can be checked out from the sourceforge svn) uses the SPROUT version reported in the ICDE 2009 paper.

Talks

Beyond Query Evaluation in Probabilistic Databases (upcoming talk)
Invited keynote, Scalable Uncertainty Management (SUM), Sept 2014, Oxford.

Probabilistic Databases are Dead, Long Live Probabilistic Databases!
Invited keynote at Big Uncertain Data Workshop at PODS, Snowbird, June 21, 2014.
Knowledge World Research Seminar, Google, February 21, 2014, Mountain View (California).

Aggregation in Probabilistic Databases [PDF]
Given by Robert at various venues in 2011-2012.

On the Optimal Approximation of Queries using Tractable Propositional Theories. [PDF]
Given by Robert at various venues in 2011-2012.

The Case of Relational Algebra Queries in Probabilistic Databases
Given by Dan at IC Colloquium, EPFL, January 2011, Lausanne.

A Look at Probabilistic Databases through SPROUT glasses
Given by Dan at the Statistics Departmental Seminar, November 2010, Oxford.

SPROUT: Scalable Query Processing in Probabilistic Databases [Feb 2010 overview talk PDF, A1 poster.pdf]
Given by Dan at various venues, 2008-2009.

A Toolbox of Query Evaluation Techniques for Probabilistic Databases. [PDF]
Invited talk at the Workshop on Logic in Databases (LID), October 2009.
Invited talk at the Workshop on Management and mining Of UNcertain Data (MOUND), In conjunction with ICDE, March 2010.

Incomplete and Probabilistic Data Management. 3-hour tutorial given at FOX training week, October 2009.

Theses

Lo Po Tsui: Approximate Evaluation for Aggregate Queries in Probabilistic Databases [PDF]

MSc in CS, Oxford 2013.

Asparuh Filipov: Minimal Map-Reduce Programs for Query Evaluation in Probabilistic Databases
MSc in CS, Oxford 2013.

Andrei Marin: Incremental View Maintenance in Probabilistic Databases
MSc in CS, Oxford 2012.

Quentin Spencer-Harper: Evaluation of Aggregate Queries on pvc-Tables
BSc in CS, Oxford 2012.

Larisa Han: Aggregates in Probabilistic Databases
MSc in CS, Oxford 2011.

Swaroop Rath: Efficient Evaluation of Full Relational Algebra on Probabilistic Databases
MSc in CS, Oxford 2010.

Jiewen Huang: Design and Implementation of the SPROUT Query Engine for Probabilistic Databases [PDF]
MSc by Research, Oxford 2009.

Rasmus Wissmann: Tractable Queries with Inequalities on Probabilistic Databases [PDF]
MSc in CS, Oxford 2009.

Smitha Mysore-Shankar: Learning Probabilistic Databases
MSc in CS, Oxford 2009.
Data collection: URLs to web-pages used to extract data and populate probabilistic databases

Jiewen Huang: Efficient Query Evaluation for Tuple-Independent Probabilistic Databases
MSc in CS, Oxford 2008.

Publications

Towards Deterministic Decomposable Circuits for Safe Queries. [PDF, code]
Mikaël Monet and Dan Olteanu.
Technical report, March 2018.

Query Processing on Probabilistic Data. [PDF]
Nilesh Dalvi and Dan Olteanu.
In Encyclopaedia of Database Systems, 2nd edition, 2017, Springer.

Uncertain Data Models. [PDF]
Christoph Koch and Dan Olteanu.
In Encyclopedia of Database Systems, vol. 2, 2017, Springer.

Dichotomies for Queries with Negation in Probabilistic Databases. [PDF]
Robert Fink and Dan Olteanu.
In ACM Transactions on Database Systems (TODS), 41(1): 4, 2016.

A dichotomy for non-repeating queries with negation in probabilistic databases. [PDF, poster]
Robert Fink and Dan Olteanu.
In PODS 2014.

Anytime Approximation in Probabilistic Databases. [preliminary version]
Robert Fink, Jiewen Huang, and Dan Olteanu.
In Very Large Data Bases Journal (VLDBJ), 2013.

Aggregates in Probabilistic Databases via Knowledge Compilation. [PDF, slides]
Robert Fink, Larisa Han, and Dan Olteanu.
In Very Large Data Bases (PVLDB), 5(5):490-501, Istanbul, 2012.

Ranking in Probabilistic Databases: Complexity and Efficient Algorithms. [PDF]
Dan Olteanu and Hongkai Wen.
In IEEE Int Conf on Data Engineering (ICDE), Washington DC, 2012.

Probablistic Databases
Synthesis Lectures on Data Management, Morgan & Claypool, 2011, 180 pages.
Dan Suciu, Dan Olteanu, Christopher Re, Christoph Koch.
DOI: 10.2200/S00362ED1V01Y201105DTM016.

The Next Big Challenge in Data Management: Probabilistic Databases.
Robert Fink and Dan Olteanu.
Invited article in The Digit Magazine, June 2011 (10th Anniversary Special), in print.
More about the Digit Magazine in Wikipedia.

Managing Probabilistic Data with SPROUT.
Dan Olteanu.
Article in the 2nd edition of Inspired Research published by the Department of Computer Science at the University of Oxford. June 2011.

SPROUT^2: A Squared Query Engine for Uncertain Web Data (Demonstration). [pdf, poster, web page ]
Robert Fink, Andrew Hogue, Dan Olteanu, and Swaroop Rath.
In ACM SIGMOD, Athens, June 2011.

Providing Support for Full Relational Algebra in Probabilistic Databases. [pdf]
Robert Fink, Dan Olteanu, and Swaroop Rath.
In IEEE Int Conf on Data Engineering (ICDE), Hannover, 2011.

On the Optimal Approximation of Queries using Tractable Propositional Theories. [pdf, slides]
Robert Fink and Dan Olteanu.
In Int Conf on Database Theory (ICDT), Uppsala, 2011.

Bridging the Gap Between Intensional and Extensional Query Evaluation in Probabilistic Databases.[pdf]
Abhay Jha, Dan Olteanu, and Dan Suciu.
In Int Conf on Extending Data Base Technology (EDBT), Lausanne, March 2010.

We extend the notion of tractable queries on (arbitrarily correlated) probabilistic databases to tractable data-query instances, where general hard queries can become tractable on restricted data instances. The query evaluation is separated into two steps: (1) the (possibly large) tractable data-query instance is evaluated first by efficient database techniques, and then (2) the (usually small) intractable residue is processed using inference techniques based on treewidth.

Approximate Confidence Computation in Probabilistic Databases. [pdf]
Dan Olteanu, Jiewen Huang, and Christoph Koch.
In IEEE Int Conf on Data Engineering (ICDE), Long Beach, March 2010.

We enhance our query engine SPROUT with a deterministic approximation algorithm with error guarantees for confidence computation in (arbitrarily correlated) probabilistic databases, which is based on incremental compilation of query lineage into so-called decomposition trees that support linear-time probability computation. The compilation is incremental and we show experimentally that it can achieve a given approximation within a few steps. In case of tractable (hierarchial or inequality) queries, it always finishes in polynomial time.

On Tractability of Inequality Queries over Probabilistic Databases and Counting Vertex Covers.
Dan Olteanu and Rasmus Wissmann.
technical report, September 2009.

Secondary-Storage Confidence Computation for Conjunctive Queries with Inequalities. [pdf]
Dan Olteanu, Jiewen Huang.
In ACM SIGMOD, Providence, 2009.

This paper is the first to discuss tractability of conjunctive queries with inequalities in probabilistic databases. It presents a class of hard queries, and gives a scalable algorithm for tractable inequality queries that is based on ordered binary decision diagrams and is implemented in our query engine SPROUT. Our implementation passed the SIGMOD'09 repeatability and workability evaluation (RWE). Out of 63 accepted papers, 19 participated in RWE, and 10 passed RWE.

Resources:
- Scripts for running the experiments.
- Code of the SPROUT query engine used in the experiments.

SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases. [pdf]
Dan Olteanu, Jiewen Huang, Christoph Koch.
In IEEE Int Conf on Data Engineering (ICDE), Shanghai, April 2009.

We show that the tractable hierarchical queries on probabilistic databases admit relational query plans with unrestricted join ordering, thereby overcoming the limitations of the safe plans of earlier work. In this framework, safe plans can be seen as one specific type of eager plans (where confidence computation is done as soon as possible). We extend relational plans with a new aggregation operator that is optimized using query signature and schema information (keys) to minimize the number of scans it needs to compute tuple confidences. This operator basically factorizes the query lineage into read-once functions (called 1OF formulas in the paper) in polynomial time. An extensive study on TPC-H queries is available here.

Resources:
- Case study: Tractable TPC-H queries on probabilistic databases.
- One-hour talk on the results of this paper.
- Current version of the SPROUT query engine.

Using OBDDs for Efficient Query Evaluation on Probabilistic Databases. [pdf]
Dan Olteanu, Jiewen Huang.
In 2nd Int. Conf. on Scalable Uncertainty Management (SUM), Napoli, Oct. 2008. Also in Proc. of Dagstuhl seminar #08421 on uncertainty management, October 2008.

This paper is the first to consider OBDDs for efficient evaluation of queries in probabilistic databases. It shows that for tractable hierarchical queries, the query lineage can be efficiently brought into one-occurrence form (that is, into an equivalent read-once function), where each variable occurs exactly once. For hard queries, existing results that bound the OBDD size in the pathwidth of the lineage apply immediately.

Conditioning Probabilistic Databases. [pdf]
Christoph Koch, Dan Olteanu.
In Very Large Data Bases (PVLDB), volume 1, 2008.
Also ACM CORR Report cs.DB/0803.2212.

This paper is the first to consider the problem of conditioning a probabilistic database outside of the context of graphical models. The core contribution is an exact confidence computation algorithm and its extension to achieve conditioning.

Resources:
- CODE: Exact and approximate probability computation using world-set trees (now part of MayBMS).
- Talk given at 34th Int. Conf. on Very Large Data Bases (VLDB), Auckland, Aug 2008.

Current team

Dan Olteanu

Collaborators: Andrew Hogue (Google), Abhay Jha (U. of Washington), Christoph Koch (Cornell University, now at EPFL), Dan Suciu ((U. of Washington)

Alumni: Asparuh Filipov (MSc student), Robert Fink (PhD student), Larisa Han (MSc student), Jiewen Huang (research student), Andrei Marin (MSc student), Smitha Mysore-Shankar (MSc student), Swaroop Rath (MSc student), Quentin Spencer-Harper (BSc student), Yunli Sung (MSc student), Lo Po Tsui (MSc student), Rasmus Wissmann (MSc student), Hongkai Wen (research student)

Acknowledgments

From May 2009 to Oct 2012, SPROUT is partially supported by the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under the FETOpen grant agreement FOX, number FP7-ICT-233599. During the academic year 2008-2009, Jiewen Huang was generously supported by a scholarship from Cornell University. Starting April 2012, SPROUT is partially supported by the EPSRC grant PROQAW. Dan Olteanu also acknowledges the support offered by an Astor Travel Fund grant from August 2013 to May 2014.