Recent Advances in Text-to-SQL:
A Survey of What We Have and What We Expect
Naihao Deng |
Yulong Chen |
Yue Zhang |
[Paper] |
[GitHub] |
[YouTube] / [BiliBili] |
2023 | BIRD, a new cross-domain text-to-SQL benchmarks in the era of LLMs. | Paper |
Text-to-SQL has attracted attention from both the natural language processing and database communities because of its ability to convert the semantics in natural language into SQL queries and its practical application in building natural language interfaces to database systems.
The major challenges in text-to-SQL lie in encoding the meaning of natural utterances, decoding to SQL queries, and translating the semantics between these two forms. These challenges have been addressed to different extents by the recent advances. However, there is still a lack of comprehensive surveys for this task. To this end, we review recent progress on text-to-SQL for datasets, methods, and evaluation and provide this systematic survey, addressing the aforementioned challenges and discussing potential future directions. We hope this survey can serve as quick access to existing work and motivate future research. |
Citation
If you would like to cite our work, please cite the following work:
Naihao Deng, Yulong Chen, and Yue Zhang. 2022. Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect
@inproceedings{deng-etal-2022-recent-advances-in-text-to-sql,
title = "Recent Advances in Text-to-{SQL}: A Survey of What We Have and What We Expect",
author = "Deng, Naihao and
Chen, Yulong and
Zhang, Yue",
booktitle = "COLING",
month = oct,
year = "2022",
address = "Gyeongju, Republic of Korea",
publisher = "International Committee on Computational Linguistics",
url = "https://aclanthology.org/events/coling-2022/",
}
Survey Sketch
Concept Diagram
![]() |
---|
The framework for text-to-SQL systems. Given the database schema and user utterance, the system outputs a corresponding SQL query to query the database system for the result. Appendix B gives more text-to-SQL examples. |
![]() |
---|
Topology for text-to-SQL. Format adapted from (Liu et al., 2021a) |
Datasets
Other relevant datasets
- Chinest text-to-SQL:
- CSpider (Min et al., 2019a)
- TableQA (Sun et al., 2020)
- DuSQL (Wang et al., 2020c)
- ESQL (Chen et al., 2021a)
- Vietnamese text-to-SQL: ViText2SQL (Tuan Nguyen et al., 2020)
- Portuguese text-to-SQL: (José and Cozman, 2021)
- Multi-turn context-dependent text-to-SQL:
- ATIS (Price, 1990) (Dahl et al., 1994)
- Sparc (Yu et al., 2019b)
- CoSQL (Yu et al., 2019a)
- Unanswerable questions: TriageSQL (Zhang et al., 2020)
Resources
-
This Github Repo holds data for many of the single domain datasets including Academic, Advising, ATIS, Geography, IMDB, Restaurants, Scholar, Yelp; as well as cross-domain datasets including Spider and WikiSQL. The Github Repo is the official repo for This paper, and their methods of processing the data is described in the paper as well.
-
The official Spider leaderboard shows the performance of models on Spider dev and the hidden test set.
-
The official WikiSQL Github Repo holds leaderboard for models on WikiSQL dataset.
Methods
Data Augmentation
-
Typical effects:
- Handle complex or unseen questions (Zhong et al., 2020b), (Wang et al., 2021b)
- Achieve state-of-the-art with less supervised data (Guo et al., 2018)
- Attain robustness towards different types of questions (Radhakrishnan et al., 2020)
-
Data generation methods:
-
Quality control of the generated data:
-
Diversify the generated data:
Encoding
Decoding
Learning Techniques
-
Fully supervised
- Active learning (Ni et al., 2020)
- Interactive/Imitation learning (Yao et al., 2019) (Yao et al., 2020)
- Meta-learning (Huang et al., 2018) (Wang et al., 2021a) (Chen et al., 2021a)
- Multi-task learning (Chang et al., 2020) (Xuan et al., 2021) (Hui et al., 2021b) (Shi et al., 2021) (McCann et al., 2018) (Xie et al., 2022)
-
Weakly supervised
- Reinforcement learning (Zhong et al., 2017) (Liang et al., 2018)
- Meta-learning and Bayesian optimization (Agarwal et al., 2019)
- (Min et al., 2019b)
Miscellaneous
- DB linking:
- Model-wise:
- SQL generation:
For context-dependent text-to-SQL:
- Turn-level encoder and copy mechanism (Suhr et al., 2018) (Zhang et al., 2019) (Wang et al., 2020b)
- Constrained decoding (Wang et al., 2020b)
- Dynamic memory decay mechanism (Hui et al., 2021a)
- (Zheng et al., 2022)
Evaluation
Metrics
Evaluation Setup
- Question split: split the question-SQL pairs randomly (Iyer et al., 2017).
- SQL query split: no SQL query is allowed to appear in more than one set among the train, dev, and test sets (Finegan-Dollak et al., 2018)
- Database split: databases in the test set do not appear in the training time (Yu et al., 2018c)
- Others (Shaw et al., 2021) (Chang et al., 2020)
Discussion and Future Directions
-
Cross-domain text-to-SQL: incorporate domain knowledge to models trained on the existing datasets, and deploy such models efficiently on different domains.
-
Real-world use cases:
- Handle corrupted tables or scenarios where no table is provided.
- Handle user inputs different from the existing datasets.
- Facilitate DB administrator to manage DB schemas, updating DB content.
- Multi-lingual text-to-SQL.
- DB interface for the disabled.
-
Integrated into a larger scope of research:
- QA system for DB.
- Dialogue system with knowledge from DB.
- Explore the inter-relation between SQL and other logical forms.
- Generalized semantic parsing.
-
Others:
- Apply prompt learning to text-to-SQL: how to make the system robust.
- Evalulation of the existing text-to-SQL systems.
Text-to-SQL examples
Example of the table in the database
The domain for Restaurant dataset is restaurant information, where questions are typically about food type, restaurant location, etc.
There is a big difference in terms of how many tables a database has. For restaurants, there are 3 tables in the database, while there are 32 tables in ATIS (Suhr et al., 2020).
Domain Knowledge
Question:
Will undergrads be okay to take 581 ?
SQL query:
SELECT DISTINCT T1.ADVISORY_REQUIREMENT ,
T1.ENFORCED_REQUIREMENT , T1.NAME FROM
COURSE AS T1 WHERE T1.DEPARTMENT =
"EECS" AND T1.NUMBER = 581 ;
In Advising dataset, Department “EECS” is considered as domain knowledge where “581” in the utterance means a course in “EECS” department with course number “581”.
Dataset Convention
Question:
Give me some restaurants in alameda ?
SQL query:
SELECT T1.HOUSE_NUMBER ,
T2.NAME FROM LOCATION AS T1 , RESTAURANT
AS T2 WHERE T1.CITY_NAME = "alameda"
AND T2.ID = T1.RESTAURANT_ID ;
In Restaurants dataset, when the user queries “restaurants”, by dataset convention, the corresponding SQL query returns the column “HOUSE_NUMBER” and “NAME”.
Text-to-SQL Templates
An example of the template for text-to-SQL pair used by (Iyer et al., 2017) is as follows:
Question template:
Get all <ENT1>.<NAME> having <ENT2>.<COL1>.<NAME> as <ENT2>.<COL1>.<TYPE>
SQL query template:
SELECT <ENT1>.<DEF> FROM JOIN_FROM(
<ENT1>, <ENT2>) WHERE JOIN_WHERE(<ENT1>,
<ENT2>) AND
<ENT2>.<COL1> = <ENT2>.<COL1>.<TYPE> ;
Generated question:
Get all author having dataset as DATASET_TYPE
Generated SQL query:
SELECT author.authorId
FROM author , writes , paper ,
paperDataset , dataset WHERE author.
authorId = writes.authorId
AND writes.paperId = paper.paperId
AND paper.paperId = paperDataset.paperId
AND paperDataset.datasetId = dataset.
datasetId AND dataset.datasetName =
DATASET_TYPE ;
, where they populate the slots in the templates with table and column names from the database schema, as well as join the corresponding tables accordingly.
(Iyer et al., 2017) also uses PPDB (Ganitkevitch et al., 2013) to paraphrase the NL. An example of PPDB paraphrasing is thrown into jail
and imprisoned
.
This Github Repo also holds NL-SQL templates for Academic, Advising, ATIS, Geography, IMDB, Restaurants, Scholar, Yelp, where they mask the corresponding entities in the NL-SQL pairs.
Complexity of NL-SQL pairs
(Yu et al., 2018c) defines the SQL hardness as the number of SQL components. The SQL query is harder when it contains more SQL keywords such as GROUP BY
and nested subqueries. Here are some examples from the original paper:
Easy:
SELECT COUNT(*)
FROM cars_data
WHERE cylinders > 4 ;
Medium:
SELECT T2.name, COUNT(*)
FROM concert AS T1 JOIN stadium AS T2 ON
T1.stadium_id = T2.stadium_id GROUP
BY T1.stadium_id ;
Hard:
SELECT T1.country_name
FROM countries AS T1 JOIN continents AS
T2 ON T1.continent = T2.cont_id JOIN
car_makers AS T3 ON T1.country_id = T3.
country
WHERE T2.continent = ’Europe’
GROUP BY T1.country_name
HAVING COUNT(*) >= 3 ;
Extra Hard:
SELECT AVG(life_expectancy) FROM country
WHERE name NOT IN
(SELECT T1.name
FROM country AS T1 JOIN
country_language AS T2
ON T1.code = T2.country_code
WHERE T2.language = "English"
AND T2.is_official = "T") ;
There is no qualitative measure of how hard the NL is. Intuitively, models’ performance can decrease when faced with longer questions from users. However, the information conveyed in longer sentences can be more complete, while there can be ambiguity in shorter sentences. Besides, there can be domain-specific phrases that confuse the model in both short and long utterances (Suhr et al., 2020). Thus, researchers need to consider various perspectives to determine the complexity of natural utterance.
Acknowledgements |