Scalable Query Processing in Probabilistic Databases
Today, uncertainty is commonplace in data management scenarios dealing with data integration, sensor readings, information extraction from unstructured sources, and whenever information is manually entered and therefore prone to inaccuracy or partiality. Key challenges in probabilistic data management are to design probabilistic database formalisms that can compactly represent large sets of possible interpretations of uncertain data together with their probability distributions, and to efficiently evaluate queries on very large probabilistic data. Such queries could ask for confidences in data patterns possibly in the presence of additional evidence. The problem of query evaluation in probabilistic databases is still in its infancy. Little is known about which queries can be evaluated in polynomial time, and the few existing evaluation methods employ expensive main-memory algorithms.
The aim of this project is to develop techniques for scalable query processing in probabilistic databases and use them to build a robust query engine called SPROUT ( S calable PRO cessing on U ncertain T ables). We are currently exploring three main research directions.
- We are investigating open problems in efficient query evaluation. In particular, we aim at discovering classes of tractable (i.e., computable in polynomial time wrt data complexity) queries on probabilistic databases. The query language under investigation is SQL (and its formal core, relational algebra) extended with uncertainty-aware query constructs to create probabilistic data under various probabilistic data models (such as tuple-independent databases, block-independent disjoint databases, or U-relations of MayBMS).
- For the case of intractable queries, we investigate approximate query evaluation. In contrast to exact evaluation, which computes query answers together with their exact confidences, approximate evaluation computes the query answers with approximate confidences. We are working on new techniques for approximate query evaluation that are aware of the query and the input probabilistic database model (tuple-independent, block-independent disjoint, etc).
- Our open-source query engine for probabilistic data management systems uses the insights gained from the first two directions. This engine is based on efficient secondary-storage exact and approximate evaluation algorithms for arbitrary queries.
News
- [Oct 2009] Rasmus Wissmann won the Hoare prize for the best overall performance in our MSc in Computer Science at Oxford. His thesis on "Tractable Queries with Inequalities on Probabilistic Databases" was also awarded distinction.
- [July 2009] Our SIGMOD'09 paper on tractable conjunctive queries with inequalities passed the SIGMOD'09 repeatability and workability evaluation (RWE). Out of 63 accepted papers, 19 participated in RWE, and 10 papers passed RWE.
- [May 2009] Jiewen Huang won a Travel Grant from ACM SIGMOD to attend SIGMOD 2009! Read more here about the grant.
- [Sept 2008] Jiewen Huang won the Hoare prize for the best MSc thesis in Computer Science at Oxford. His thesis on "Scalable
Query Evaluation for Tuple-Independent Probabilistic Databases" investigates aspects of this project and the results are reported
in the SUM'08 and ICDE'09 papers.
Jiewen also won the Hoare prize for the best overall performance in his MSc year at Oxford.
- [August 2008] SPROUT prototype released as part of the MayBMS database management system available at maybms.sourceforge.net
System prototype
SPROUT is implemented as an extension of the PostgreSQL backend with secondary-storage and main-memory algorithms for exact or approximate confidence computation. The latest version of SPROUT is available here. Major releases of SPROUT are also included in the MayBMS database management system available at maybms.sourceforge.net.
Talks
- SPROUT: Scalable Query Processing in Probabilistic Databases [PDF]
Given by Dan at various venues, 2008-2009.
- A Toolbox of Query Evaluation Techniques for Probabilistic Databases.
Invited talk at the Workshop on Logic in Databases (LID), October 2009.
Invited talk at the Workshop on Management and mining Of UNcertain Data (MOUND), In conjunction with ICDE, March 2010.
- Incomplete and Probabilistic Data Management. 3-hour tutorial given at FOX training week, October 2009.
Theses
- Jiewen Huang: Design and Implementation of the SPROUT Query Engine for Probabilistic Databases [PDF]
MSc by Research, Oxford 2009.
- Rasmus Wissmann: Tractable Queries with Inequalities on Probabilistic Databases [PDF]
MSc in CS, Oxford 2009.
- Smitha Mysore-Shankar: Learning Probabilistic Databases
MSc in CS, Oxford 2009.
Data collection: URLs to web-pages used to extract data and populate probabilistic databases
- Jiewen Huang: Efficient Query Evaluation for Tuple-Independent Probabilistic Databases
MSc in CS, Oxford 2008.
Hoare prizes for best MSc in CS thesis and best overall MSc in CS performance at Oxford
Publications
- Bridging the Gap Between Intensional and Extensional Query Evaluation in Probabilistic Databases.[pdf]
Abhay Jha, Dan Olteanu, and Dan Suciu.
To appear in Int Conf on Extending Data Base Technology (EDBT), Lausanne, March 2010.
We extend the notion of tractable queries on (arbitrarily correlated) probabilistic databases to tractable data-query instances, where general hard queries can become tractable on restricted data instances. The query evaluation is separated into two steps: (1) the (possibly large) tractable data-query instance is evaluated first by efficient database techniques, and then (2) the (usually small) intractable residue is processed using inference techniques based on treewidth.
- Approximate Confidence Computation in Probabilistic Databases. [pdf]
Dan Olteanu, Jiewen Huang, and Christoph Koch.
To appear in Proc. of IEEE Int Conf on Data Engineering (ICDE), Long Beach, March 2010 (full paper).
We enhance our query engine SPROUT with a deterministic approximation algorithm with error guarantees for confidence computation in (arbitrarily correlated) probabilistic databases, which is based on incremental compilation of query lineage into so-called decomposition trees that support linear-time probability computation. The compilation is incremental and we show experimentally that it can achieve a given approximation within a few steps. In case of tractable (hierarchial or inequality) queries, it always finishes in polynomial time.
- On Tractability of Inequality Queries over Probabilistic Databases and Counting Vertex Covers.
Dan Olteanu and Rasmus Wissmann.
technical report, September 2009.
- Secondary-Storage Confidence Computation for Conjunctive Queries with Inequalities. [pdf]
Dan Olteanu, Jiewen Huang.
In Proc. of ACM Special Interest Group on Management of Data (SIGMOD), Providence, 2009.
This paper is the first to discuss tractability of conjunctive queries with inequalities in probabilistic databases. It presents a class of hard queries, and gives a scalable algorithm for tractable inequality queries that is based on ordered binary decision diagrams and is implemented in our query engine SPROUT. Our implementation passed the SIGMOD'09 repeatability and workability evaluation (RWE). Out of 63 accepted papers, 19 participated in RWE, and 10 passed RWE.
Resources:
- SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases. [pdf]
Dan Olteanu, Jiewen Huang, Christoph Koch.
In Proc. of IEEE Int Conf on Data Engineering (ICDE), Shanghai, April 2009 (full paper).
We show that the tractable hierarchical queries on probabilistic databases admit relational query plans with unrestricted join ordering, thereby overcoming the limitations of the safe plans of earlier work. In this framework, safe plans can be seen as one specific type of eager plans (where confidence computation is done as soon as possible). We extend relational plans with a new aggregation operator that is optimized using query signature and schema information (keys) to minimize the number of scans it needs to compute tuple confidences. This operator basically factorizes the query lineage into read-once functions (called 1OF formulas in the paper) in polynomial time. An extensive study on TPC-H queries is available here.
Resources:- Case study: Tractable TPC-H queries on probabilistic databases.
- One-hour talk on the results of this paper.
- Current version of the SPROUT query engine.
- Using OBDDs for Efficient Query Evaluation on Probabilistic Databases. [pdf]
Dan Olteanu, Jiewen Huang.
In Proc. of 2nd Int. Conf. on Scalable Uncertainty Management (SUM), Napoli, Oct. 2008. Also in Proc. of Dagstuhl seminar #08421 on uncertainty management, October 2008.
This paper is the first to consider OBDDs for efficient evaluation of queries in probabilistic databases. It shows that for tractable hierarchical queries, the query lineage can be efficiently brought into one-occurrence form (that is, into an equivalent read-once function), where each variable occurs exactly once. For hard queries, existing results that bound the OBDD size in the pathwidth of the lineage apply immediately.
- Conditioning Probabilistic Databases. [pdf]
Christoph Koch, Dan Olteanu.
In Proc. of VLDB Endowment (PVLDB), volume 1, 2008.
Also ACM CORR Report cs.DB/0803.2212.
This paper is the first to consider the problem of conditioning a probabilistic database outside of the context of graphical models. The core contribution is an exact confidence computation algorithm and its extension to achieve conditioning.
Resources:- CODE: Exact and approximate probability computation using world-set trees (now part of MayBMS).
- Talk given at 34th Int. Conf. on Very Large Data Bases (VLDB), Auckland, Aug 2008.
Current team
- Dan Olteanu
- Robert Fink
Collaborators: Abhay Jha (U. of Washington), Christoph Koch (Cornell University), Dan Suciu ((U. of Washington)
Alumni: Jiewen Huang (research student), Smitha Mysore-Shankar (MSc student), Yunli Sung (MSc student), Rasmus Wissmann (MSc student)
Acknowledgments
Starting May 2009, the SPROUT research project is partially supported by the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under the FETOpen grant agreement FOX, number FP7-ICT-233599. During the academic year 2008-2009, Jiewen Huang was generously supported by a scholarship from Cornell University.
info
|
Duration |
April 2008 to August 2012 |
|
People |
|
|
Activities |
|
|
Themes |
