Full Record

New Search | Similar Records

Title Automated synthesis of data extraction and transformation programs
Publication Date
Date Accessioned
Degree PhD
Discipline/Department Computer Science
Degree Level doctoral
University/Publisher University of Texas – Austin
Abstract Due to the abundance of data in today’s data-rich world, end-users increasingly need to perform various data extraction and transformation tasks. While many of these tedious tasks can be performed in a programmatic way, most end-users lack the required programming expertise to automate them and end up spending their valuable time in manually performing various data- related tasks. The field of program synthesis aims to overcome this problem by automatically generating programs from informal specifications, such as input-output examples or natural language. This dissertation focuses on the design and implementation of new systems for automating important classes of data transformation and extraction tasks. It introduces solutions for automating data manipulation tasks on fully- structured data formats like relational tables, or on semi-structured formats such as XML and JSON documents. First, we describe a novel algorithm for synthesizing hierarchical data transformations from input-output examples. A key novelty of our approach is that it reduces the synthesis of tree transformations to the simpler problem of synthesizing transformations over the paths of the tree. We also describe a new and effective algorithm for learning path transformations that combines logical SMT-based reasoning with machine learning techniques based on decision trees. Next, we present a new methodology for learning programs that migrate tree-structured documents to relational table representations from input-output examples. Our approach achieves its goal by decomposing the synthesis task to two subproblems of (A) learning the column extraction logic, and (B) learning the row extraction logic. We propose a technique for learning column extraction programs using deterministic finite automata, and a new algorithm for predicate learning which combines integer linear programing and logic minimization. Finally, we address the problem of automating data extraction tasks from natural language. Specifically, we focus on data retrieval from relational databases and describe a novel approach for learning SQL queries from English descriptions. The method we describe is fully automatic and database-agnostic (i.e., does not require customization for each database). Our method combines semantic parsing techniques from the NLP community with novel programming languages ideas involving probabilistic type inhabitation and automated sketch repair.
Subjects/Keywords Program synthesis; Programming-by-examples; Programming-by-natural-language; Databases
Contributors Dillig, Isil (advisor); Pingali, Keshav (committee member); Mooney, Raymond (committee member); Solar-Lezama, Armando (committee member)
Language en
Country of Publication us
Record ID handle:2152/68138
Repository texas
Date Retrieved
Date Indexed 2020-10-15
Grantor The University of Texas at Austin
Issued Date 2017-12-11 00:00:00
Note [department] Computer Sciences;

Sample Images