Complexity Bounds for Relational Algebra over Text Extractors
- 12:00 29th January 2019 ( Hilary Term 2019 )051
Information Extraction (IE) is the task of extracting structured information from textual data. We explore a programming paradigm that is supported by several IE systems where relations extracted by atomic extractors undergo a relational manipulation. In our efforts toward achieving a better understanding of IE systems, we study the computational complexity of queries in the framework of document spanners (spanners, for short) that was introduced by Fagin et al. A spanner is a function that extracts from a document (string) a relation over text intervals, called spans, using either atomic extractors or a relational algebra query on top of these extractors. Evaluating a spanner on a document is a computational problem that involves three components: the atomic extractors, the relational structure of the query, and the document. We investigate the complexity of this problem from various angles, each keeping some components fixed and regarding the rest as input.