Recent Advances in Text-to-SQL:
A Survey of What We Have and What We Expect

News in Text-to-SQL

2023

BIRD, a new cross-domain text-to-SQL benchmarks in the era of LLMs.

Abstract

Text-to-SQL has attracted attention from both the natural language processing and database communities because of its ability to convert the semantics in natural language into SQL queries and its practical application in building natural language interfaces to database systems.
The major challenges in text-to-SQL lie in encoding the meaning of natural utterances, decoding to SQL queries, and translating the semantics between these two forms.
These challenges have been addressed to different extents by the recent advances.
However, there is still a lack of comprehensive surveys for this task.
To this end, we review recent progress on text-to-SQL for datasets, methods, and evaluation and provide this systematic survey, addressing the aforementioned challenges and discussing potential future directions.
We hope this survey can serve as quick access to existing work and motivate future research.

Citation

If you would like to cite our work, please cite the following work:

Naihao Deng, Yulong Chen, and Yue Zhang. 2022. Recent Advances in Text-to-SQL: A Survey of What We Have and What We Expect

@inproceedings{deng-etal-2022-recent-advances-in-text-to-sql,
    title = "Recent Advances in Text-to-{SQL}: A Survey of What We Have and What We Expect",
    author = "Deng, Naihao  and
      Chen, Yulong  and
      Zhang, Yue",
    booktitle = "COLING",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/events/coling-2022/",
}

Survey Sketch

Concept Diagram


The framework for text-to-SQL systems. Given the database schema and user utterance, the system outputs a corresponding SQL query to query the database system for the result. Appendix B gives more text-to-SQL examples.


Topology for text-to-SQL. Format adapted from (Liu et al., 2021a)

Datasets

Summarization for text-to-SQL datasets. #Size, #DB, #D, and #T/DB represent the number of question-SQL pairs, databases, domains, and tables per domain, respectively. We put “-” in the #D column because we do not know how many domains are in the Spider dev set and “-” in the Issues Addressed column because there is no specific issue addressed for the dataset. Datasets above and below the line are cross-domain and single-domain, respectively.

Datasets	#Size	#D	#DB	#T/DB	Issues addressed	Sources for data
Spider (Yu et al., 2018c)	10,181	138	200	5.1	Domain generalization	College courses, DatabaseAnswers, WikiSQL
Spider-DK (Gan et al., 2021b)	535	-	10	4.8	Domain knowledge	Spider dev set
Spider-tran (Zeng et al., 2020)	15,023	138	200	5.1	Untranslatable questions	Spider + 5,330 untranslatable questions
Spider-L (Lei et al., 2020)	8,034	-	160	5.1	Schema linking	Spider train/dev
SpiderSL (Taniguchi et al., 2021)	1,034	-	10	4.8	Schema linking	Spider dev set
Spider-Syn (Gan et al., 2021a)	8,034	-	160	5.1	Robustness	Spider train/dev
WikiSQL (Zhong et al., 2017)	80,654	-	26,521	1	Data size	Wikipedia
Squall (Shi et al., 2020b)	11,468	-	1,679	1	Lexicon-level supervision	WikiTableQuestions (Pasupat and Liang, 2015)
KaggleDBQA (Lee et al., 2021)	272	8	8	2.3	Domain generalization	Real web databases
ATIS (Price, 1990), (Dahl et al., 1994)	5,280	1	1	32	-	Flight-booking
GeoQuery (Zelle and Mooney, 1996)	877	1	1	6	-	US geography
Scholar (Iyer et al., 2017)	817	1	1	7	-	Academic publications
Academic (Li and Jagadish, 2014)	196	1	1	15	-	Microsoft Academic Search (MAS)
IMDB (Yaghmazadeh et al., 2017)	131	1	1	16	-	database Internet Movie Database
Yelp (Yaghmazadeh et al., 2017)	128	1	1	7	-	Yelp website
Advising (Finegan-Dollak et al., 2018)	3,898	1	1	10	-	University of Michigan course
Restaurants (Tang and Mooney, 2000) (Popescu et al., 2003)	378	1	1	3	-	information Restaurants
MIMICSQL (Wang et al., 2020d)	10,000	1	1	5	-	Healthcare domain
SEDE (Hazoom et al., 2021)	12,023	1	1	29	SQL template diversity	Stack Exchange

Other relevant datasets

Chinest text-to-SQL:
- CSpider (Min et al., 2019a)
- TableQA (Sun et al., 2020)
- DuSQL (Wang et al., 2020c)
- ESQL (Chen et al., 2021a)
Vietnamese text-to-SQL: ViText2SQL (Tuan Nguyen et al., 2020)
Portuguese text-to-SQL: (José and Cozman, 2021)
Multi-turn context-dependent text-to-SQL:
- ATIS (Price, 1990) (Dahl et al., 1994)
- Sparc (Yu et al., 2019b)
- CoSQL (Yu et al., 2019a)
Unanswerable questions: TriageSQL (Zhang et al., 2020)

Resources

This Github Repo holds data for many of the single domain datasets including Academic, Advising, ATIS, Geography, IMDB, Restaurants, Scholar, Yelp; as well as cross-domain datasets including Spider and WikiSQL. The Github Repo is the official repo for This paper, and their methods of processing the data is described in the paper as well.
The official Spider leaderboard shows the performance of models on Spider dev and the hidden test set.
The official WikiSQL Github Repo holds leaderboard for models on WikiSQL dataset.

Methods

Data Augmentation

Typical effects:
- Handle complex or unseen questions (Zhong et al., 2020b), (Wang et al., 2021b)
- Achieve state-of-the-art with less supervised data (Guo et al., 2018)
- Attain robustness towards different types of questions (Radhakrishnan et al., 2020)
Data generation methods:
- (Iyer et al., 2017)
- (Li et al., 2020a)
Quality control of the generated data:
- (Zhong et al., 2020b)
- (Wu et al., 2021)
Diversify the generated data:

Encoding

Methods used for encoding in text-to-SQL.

Methods	Adopted by	Applied datasets	Addressed challenges
Encode token type	TypeSQL (Yu et al., 2018a)	WikiSQL	Representing question meaning
Graph-based	GNN (Bogin et al., 2019a)	Spider	(1) Representing question and DB schemas in a structured way (2) Schema linking
	Global-GCN (Bogin et al., 2019b)	Spider
	IGSQL (Cai and Wan, 2020)	Sparc, CoSQL
	RAT-SQL (Wang et al., 2020a)	Spider
	LEGSQL (Cao et al., 2021)	Spider
	SADGA (Cai et al., 2021)	Spider
	ShawdowGNN (Chen et al., 2021b)	Spider
	S2SQL (Hui et al., 2022)	Spider, Spider-Syn
Self-attention	X-SQL (He et al., 2019)	WikiSQL
	SQLova (Hwang et al., 2019)	WikiSQL
	RAT-SQL (Wang et al., 2020a)	Spider
	DuoRAT (Scholak et al., 2021a)	Spider
	UnifiedSKG (Xie et al., 2022)	WikiSQL, Spider
Adapt PLM	X-SQL (He et al., 2019)	WikiSQL	Leveraging external data to represent question and DB schemas
	SQLova (Hwang et al., 2019)	WikiSQL
	(Guo and Gao, 2019)	WikiSQL
	HydraNet (Lyu et al., 2020)	WikiSQL
	(Liu et al., 2021b), etc	Spider-L, SQUALL
Pre-training	TaBERT (Yin et al., 2020)	Spider
	GraPPA (Yu et al., 2021)	Spider
	GAP (Shi et al., 2020a)	Spider

Decoding

Methods used for decoding in text-to-SQL. ♠: Academic, Advising, ATIS, GeoQuery, Yelp, IMDB, Scholar, Restaurants; ♡: TableQA DuSQL, CoSQL, Sparc, Chase.

Methods		Adopted by	Applied datasets	Addressed challenges
Tree-based		Seq2Tree (Dong and Lapata, 2016)	-	Hierarchical decoding
		Seq2AST (Yin and Neubig, 2017)	-
		SyntaxSQLNet (Yu et al., 2018b)	Spider
Sketch-based		SQLNet (Xu et al., 2017)	WikiSQL
		(Dong and Lapata, 2018)	WikiSQL
		IRNet (Guo et al., 2019)	Spider
		RYANSQL (Choi et al., 2021)	Spider
Bottom-up		SmBop (Rubin and Berant, 2021)	Spider
Attention Mechanism	Attention	Seq2Tree (Dong and Lapata, 2016)	-	Synthesizing information for decoding
	Attention	Seq2SQL (Zhong et al., 2017)	WikiSQL
	Bi-attention	(Guo and Gao, 2018)	WikiSQL
	Structured attention	(Wang et al., 2019)	WikiSQL
	Relation-aware Self-attention	DuoRAT (Scholak et al., 2021a)	Spider
Copy Mechanism		Seq2AST (Yin and Neubig, 2017)	-
		Seq2SQL (Zhong et al., 2017)	WikiSQL
		(Wang et al., 2018a)	WikiSQL
		SeqGenSQL (Li et al., 2020a)	WikiSQL
Intermediate Representation		IncSQL (Shi et al., 2018)	WikiSQL	Bridging the gap between natural language and SQL query
		IRNet (Guo et al., 2019)	Spider
		(Suhr et al., 2020)	Spider and others♠
		(Herzig et al., 2021)	GeoQuery, ATIS, Scholar
		(Gan et al., 2021c)	Spider
		(Brunner and Stockinger, 2021)	Spider
Others	Constrained decoding	UniSAr (Dou et al., 2022)	WikiSQL, Spider and others♡	Fine-grained decoding
	Constrained decoding	PICARD (Scholak et al., 2021b)	Spider, CoSQL
	Execution-guided	SQLova (Hwang et al., 2019)	WikiSQL
	Execution-guided	(Wang et al., 2018b)	WikiSQL
	Discriminative re-ranking	Global-GCN (Bogin et al., 2019b)	Spider	SQL Ranking
	Discriminative re-ranking	(Kelkar et al., 2020)	Spider	SQL Ranking
	Separate submodule	SQLNet (Xu et al., 2017)	WikiSQL	Easier decoding
		(Guo and Gao, 2018)	WikiSQL
		(Lee, 2019)	Spider
	BPE	(Muller and Vlachos, 2019)	Advising, ATIS, GeoQuery
	Link gating	(Chen et al., 2020b)	Spider	Synthesizing information for decoding

Learning Techniques

Fully supervised
- Active learning (Ni et al., 2020)
- Interactive/Imitation learning (Yao et al., 2019) (Yao et al., 2020)
- Meta-learning (Huang et al., 2018) (Wang et al., 2021a) (Chen et al., 2021a)
- Multi-task learning (Chang et al., 2020) (Xuan et al., 2021) (Hui et al., 2021b) (Shi et al., 2021) (McCann et al., 2018) (Xie et al., 2022)
Weakly supervised
- Reinforcement learning (Zhong et al., 2017) (Liang et al., 2018)
- Meta-learning and Bayesian optimization (Agarwal et al., 2019)
- (Min et al., 2019b)

Miscellaneous

DB linking:
- (Lin et al., 2020)
- (Ma et al., 2020)
Model-wise:
SQL generation:
- (Shi et al., 2018)
- (Brunner and Stockinger, 2021)

For context-dependent text-to-SQL:

Turn-level encoder and copy mechanism (Suhr et al., 2018) (Zhang et al., 2019) (Wang et al., 2020b)
Constrained decoding (Wang et al., 2020b)
Dynamic memory decay mechanism (Hui et al., 2021a)
(Zheng et al., 2022)

Evaluation

Metrics

The summary of metrics, datasets that use these metrics, and their potential error cases.


Metrics	Datasets	Errors
Naiive Execution Accuracy	GeoQuery, IMDB, Yelp, WikiSQL, etc	False positive
Exact String Match	Advising, WikiSQL, etc	False negative
Exact Set Match	Spider	False negative
Test Suite Accuracy (execution accuracy with generated databases)	Spider, GeoQuery, etc	False positive

Evaluation Setup

Question split: split the question-SQL pairs randomly (Iyer et al., 2017).
SQL query split: no SQL query is allowed to appear in more than one set among the train, dev, and test sets (Finegan-Dollak et al., 2018)
Database split: databases in the test set do not appear in the training time (Yu et al., 2018c)
Others (Shaw et al., 2021) (Chang et al., 2020)

Discussion and Future Directions

Cross-domain text-to-SQL: incorporate domain knowledge to models trained on the existing datasets, and deploy such models efficiently on different domains.
Real-world use cases:
- Handle corrupted tables or scenarios where no table is provided.
- Handle user inputs different from the existing datasets.
- Facilitate DB administrator to manage DB schemas, updating DB content.
- Multi-lingual text-to-SQL.
- DB interface for the disabled.
Integrated into a larger scope of research:
- QA system for DB.
- Dialogue system with knowledge from DB.
- Explore the inter-relation between SQL and other logical forms.
- Generalized semantic parsing.
Others:
- Apply prompt learning to text-to-SQL: how to make the system robust.
- Evalulation of the existing text-to-SQL systems.

Text-to-SQL examples

Example of the table in the database

The domain for Restaurant dataset is restaurant information, where questions are typically about food type, restaurant location, etc.

There is a big difference in terms of how many tables a database has. For restaurants, there are 3 tables in the database, while there are 32 tables in ATIS (Suhr et al., 2020).

Geography, one of the 3 tables in Restaurants database. * denotes the primary key of this table. We only include 3 rows for demonstration purpose.

CITY.NAME*	COUNTY	REGION
VARCHAR(255)	VARCHAR(255)	VARCHAR(255)
Alameda	Alameda County	Bay Area
Alamo	Contra Costa County	Bay Area
Albany	Alameda County	Bay Area
...	...	...

Domain Knowledge

Question:

Will undergrads be okay to take 581 ?

SQL query:

SELECT DISTINCT T1.ADVISORY_REQUIREMENT ,
T1.ENFORCED_REQUIREMENT , T1.NAME FROM
COURSE AS T1 WHERE T1.DEPARTMENT =
"EECS" AND T1.NUMBER = 581 ;

In Advising dataset, Department “EECS” is considered as domain knowledge where “581” in the utterance means a course in “EECS” department with course number “581”.

Dataset Convention

Question:

Give me some restaurants in alameda ?

SQL query:

SELECT T1.HOUSE_NUMBER ,
T2.NAME FROM LOCATION AS T1 , RESTAURANT
AS T2 WHERE T1.CITY_NAME = "alameda"
AND T2.ID = T1.RESTAURANT_ID ;

In Restaurants dataset, when the user queries “restaurants”, by dataset convention, the corresponding SQL query returns the column “HOUSE_NUMBER” and “NAME”.

Text-to-SQL Templates

An example of the template for text-to-SQL pair used by (Iyer et al., 2017) is as follows:

Question template:

Get all <ENT1>.<NAME> having <ENT2>.<COL1>.<NAME> as <ENT2>.<COL1>.<TYPE>

SQL query template:

SELECT <ENT1>.<DEF> FROM JOIN_FROM(
<ENT1>, <ENT2>) WHERE JOIN_WHERE(<ENT1>,
<ENT2>) AND
<ENT2>.<COL1> = <ENT2>.<COL1>.<TYPE> ;

Generated question:

Get all author having dataset as DATASET_TYPE

Generated SQL query:

SELECT author.authorId
FROM author , writes , paper ,
paperDataset , dataset WHERE author.
authorId = writes.authorId
AND writes.paperId = paper.paperId
AND paper.paperId = paperDataset.paperId
AND paperDataset.datasetId = dataset.
datasetId AND dataset.datasetName =
DATASET_TYPE ;

, where they populate the slots in the templates with table and column names from the database schema, as well as join the corresponding tables accordingly.

(Iyer et al., 2017) also uses PPDB (Ganitkevitch et al., 2013) to paraphrase the NL. An example of PPDB paraphrasing is thrown into jail and imprisoned.

This Github Repo also holds NL-SQL templates for Academic, Advising, ATIS, Geography, IMDB, Restaurants, Scholar, Yelp, where they mask the corresponding entities in the NL-SQL pairs.

Complexity of NL-SQL pairs

(Yu et al., 2018c) defines the SQL hardness as the number of SQL components. The SQL query is harder when it contains more SQL keywords such as GROUP BY and nested subqueries. Here are some examples from the original paper:

Easy:

SELECT COUNT(*)
FROM cars_data
WHERE cylinders > 4 ;

Medium:

SELECT T2.name, COUNT(*)
FROM concert AS T1 JOIN stadium AS T2 ON
T1.stadium_id = T2.stadium_id GROUP
BY T1.stadium_id ;

Hard:

SELECT T1.country_name
FROM countries AS T1 JOIN continents AS
T2 ON T1.continent = T2.cont_id JOIN
car_makers AS T3 ON T1.country_id = T3.
country
WHERE T2.continent = ’Europe’
GROUP BY T1.country_name
HAVING COUNT(*) >= 3 ;

Extra Hard:

SELECT AVG(life_expectancy) FROM country
WHERE name NOT IN
(SELECT T1.name
FROM country AS T1 JOIN
country_language AS T2
ON T1.code = T2.country_code
WHERE T2.language = "English"
AND T2.is_official = "T") ;

There is no qualitative measure of how hard the NL is. Intuitively, models’ performance can decrease when faced with longer questions from users. However, the information conveyed in longer sentences can be more complete, while there can be ambiguity in shorter sentences. Besides, there can be domain-specific phrases that confuse the model in both short and long utterances (Suhr et al., 2020). Thus, researchers need to consider various perspectives to determine the complexity of natural utterance.

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.

Recent Advances in Text-to-SQL:A Survey of What We Have and What We Expect