OXFORD UNIVERSITY COMPUTING LABORATORY

Scalable Query Processing in Probabilistic Databases

Today, uncertainty is commonplace in data management scenarios dealing with data integration, sensor readings, information extraction from unstructured sources, and whenever information is manually entered and therefore prone to inaccuracy or partiality. Key challenges in probabilistic data management are to design probabilistic database formalisms that can compactly represent large sets of possible interpretations of uncertain data together with their probability distributions, and to efficiently evaluate queries on very large probabilistic data. Such queries could ask for confidences in data patterns possibly in the presence of additional evidence. The problem of query evaluation in probabilistic databases is still in its infancy. Little is known about which queries can be evaluated in polynomial time, and the few existing evaluation methods employ expensive main-memory algorithms.

The aim of this project is to develop techniques for scalable query processing in probabilistic databases and use them to build a robust query engine called SPROUT ( S calable PRO cessing on U ncertain T ables). We are currently exploring three main research directions.

  • We are investigating open problems in efficient query evaluation. In particular, we aim at discovering classes of tractable (i.e., computable in polynomial time wrt data complexity) queries on probabilistic databases. The query language under investigation is SQL (and its formal core, relational algebra) extended with uncertainty-aware query constructs to create probabilistic data under various probabilistic data models (such as tuple-independent databases, block-independent disjoint databases, or U-relations of MayBMS).
  • For the case of intractable queries, we investigate approximate query evaluation. In contrast to exact evaluation, which computes query answers together with their exact confidences, approximate evaluation computes the query answers with approximate confidences. We are working on new techniques for approximate query evaluation that are aware of the query and the input probabilistic database model (tuple-independent, block-independent disjoint, etc).
  • Our open-source query engine for probabilistic data management systems uses the insights gained from the first two directions. This engine is based on efficient secondary-storage exact and approximate evaluation algorithms for arbitrary queries.

News

  • [Oct 2009] Rasmus Wissmann won the Hoare prize for the best overall performance in our MSc in Computer Science at Oxford. His thesis on "Tractable Queries with Inequalities on Probabilistic Databases" was also awarded distinction.
  • [July 2009] Our SIGMOD'09 paper on tractable conjunctive queries with inequalities passed the SIGMOD'09 repeatability and workability evaluation (RWE). Out of 63 accepted papers, 19 participated in RWE, and 10 papers passed RWE.
  • [May 2009] Jiewen Huang won a Travel Grant from ACM SIGMOD to attend SIGMOD 2009! Read more here about the grant.
  • [Sept 2008] Jiewen Huang won the Hoare prize for the best MSc thesis in Computer Science at Oxford. His thesis on "Scalable Query Evaluation for Tuple-Independent Probabilistic Databases" investigates aspects of this project and the results are reported in the SUM'08 and ICDE'09 papers.
    Jiewen also won the Hoare prize for the best overall performance in his MSc year at Oxford.
  • [August 2008] SPROUT prototype released as part of the MayBMS database management system available at maybms.sourceforge.net

System prototype

SPROUT is implemented as an extension of the PostgreSQL backend with secondary-storage and main-memory algorithms for exact or approximate confidence computation. The latest version of SPROUT is available here. Major releases of SPROUT are also included in the MayBMS database management system available at maybms.sourceforge.net.

Talks

  • SPROUT: Scalable Query Processing in Probabilistic Databases [PDF]
    Given by Dan at various venues, 2008-2009.
  • A Toolbox of Query Evaluation Techniques for Probabilistic Databases. Workshop on Logic in Databases (LID), October 2009.
  • Incomplete and Probabilistic Data Management. 3-hour tutorial given at FOX training week, October 2009.

Theses

  • Jiewen Huang: Design and Implementation of the SPROUT Query Engine for Probabilistic Databases [PDF]
    MSc by Research, Oxford 2009.
  • Rasmus Wissmann: Tractable Queries with Inequalities on Probabilistic Databases [PDF]
    MSc in CS, Oxford 2009.
  • Smitha Mysore-Shankar: Learning Probabilistic Databases
    MSc in CS, Oxford 2009.
    Data collection: URLs to web-pages used to extract data and populate probabilistic databases
  • Jiewen Huang: Efficient Query Evaluation for Tuple-Independent Probabilistic Databases
    MSc in CS, Oxford 2008.
    Hoare prizes for best MSc in CS thesis and best overall MSc in CS performance at Oxford

Publications

  • Approximate Confidence Computation in Probabilistic Databases. [pdf]
    Dan Olteanu, Jiewen Huang, and Christoph Koch.
    To appear in Proc. of IEEE Int Conf on Data Engineering (ICDE), Long Beach, March 2010 (full paper).
  • Secondary-Storage Confidence Computation for Conjunctive Queries with Inequalities. [pdf]
    Dan Olteanu, Jiewen Huang.
    In Proc. of ACM Special Interest Group on Management of Data (SIGMOD), Providence, 2009.
    Resources:
    • Scripts for running the experiments.
    • Code of the SPROUT query engine used in the experiments.
    This paper passed the SIGMOD'09 repeatability and workability evaluation (RWE). 63 accepted papers, 19 participated in RWE, and 10 papers passed RWE.
  • SPROUT: Lazy vs. Eager Query Plans for Tuple-Independent Probabilistic Databases. [pdf]
    Dan Olteanu, Jiewen Huang, Christoph Koch.
    In Proc. of IEEE Int Conf on Data Engineering (ICDE), Shanghai, April 2009 (full paper).
    Resources:
  • Conditioning Probabilistic Databases. [pdf]
    Christoph Koch, Dan Olteanu.
    In Proc. of VLDB Endowment (PVLDB), volume 1, 2008.
    Also ACM CORR Report cs.DB/0803.2212.
    Resources:

Current team

  • Dan Olteanu (PI)
  • Yunli Sung

External collaborator: Christoph Koch (Cornell University)

Alumni: Jiewen Huang (research student), Smitha Mysore-Shankar (MSc student), Rasmus Wissmann (MSc student)

Acknowledgments

Starting May 2009, the SPROUT research project is partially supported by the Future and Emerging Technologies (FET) programme within the Seventh Framework Programme for Research of the European Commission, under the FETOpen grant agreement FOX, number FP7-ICT-233599. During the academic year 2008-2009, Jiewen Huang was generously supported by a scholarship from Cornell University.

info

duration

April 2008 to August 2012

people

activities

themes

Random Image
Random Image
Random Image