Any interest in Seq 2 SQL?

Natural Language to SQL

I’m interested in creating a model for natural language (English) to SQL translation.

Motivation:

Imagine being able to take any (SQL) database or CSV (you can use sqlite to turn it into a queryable db) and ask it normal english questions and get useful answers (in the form of results).

Problem:

When querying databases, there is a necessity for familiarity with the database (schema) in order to create useful queries, in addition to experience with writing SQL (it’s a different way of thinking).

How the problem can be constrained:
  • The vocab (and thus output) will only ever be SQL keywords and unique members of the schema (column names, table names, views, database names, etc.).
  • The input can be constrained to only ever be a question.
  • The solution can be tested for syntactic validity (you can run the query and see if it returns output).
  • The problem can first require exact column / table names in the input, and then be expanded to look for similar names using a combination of WordNet and fuzzy search.
  • There is a lot of depth to the problem and even solving subsets can be very useful. (such as: first only outputting select _ from _, then allowing where clauses, then allowing aggregations, then allowing basic functions, such as sorting, then allowing joins and nesting etc.)
There really isn’t a very good existing model:

The papers I’ve skimmed really aren’t that impressive. They basically just do bidirectional + teacher forcing + attention. (also they use word2vec and GloVe). We know how to do that.

I think the biggest limiting factor is quality of dataset, and WikiSQL is a great step in the right direction and I believe will be the dataset to use, in addition to WordNet.

Any interest?

Let me know if anyone is interested in working on this problem!

Resources

Datasets (GitHub Repos):

WikiSQL: English to SQL + Schema
SENLIDB: English to SQL

Papers / Models:

Neural Translation

Seq2SQL -> Review- it was rejected
SQLNet -> Pytorch Implementation

Other

NLIDB -> Implementation
List of Popular Related Papers

9 Likes

One important point here is that generating text that has a restricted form is pretty easy to do, but no-one seems to have documented it yet. Would be cool to write a little post showing an example of this (could be text->sql, or even just a “random sql generator” to start with).

Basically, create your language model or seq2seq or whatever. Then at inference time, have some kind of state machine or other rules-based generator logic, and at each time step, sample from the RNN, and pick the top-ranked predictions limited by those that are allowed by the state machine.

You can combine this with beam search to get better results.

6 Likes

To make sure I understand correctly:

When you say random sql generator, I assume that means SELECT x, y FROM z WHERE x = a AND y = b, where x, y, z, a, b (all tokens, in addition to keywords) and the filter is completely random? (i.e. is it constrained by syntactic validity?) - later z could be something other than a table, such as a result of another query, but let’s ignore that for now.

First of all, couldn’t we only generate predictions which are already allowed? What’s the drawback to that approach?

Sampling from the RNN…

Been reading this to understand beam search for RNNs: Part 1 of ‘Image to LaTeX’

There would be (e.g.) picking columns, picking tables, picking filters, all of which rely on one another.

Say you generate SELECT ... and the highest probability options are foo or bar (using k = 2), now when you’re generating FROM ... you are constrained to only choosing a table which contains that column, so you end up generating two queries, one for foo and one for bar and their respective tables (which there could be more than one, so you choose those with highest probability). Then you do the same thing with the WHERE clause.

But what if the originally chosen, highest probability columns were incorrect?
It seems as though columns and tables should be chosen independently, possibly influenced by one another?

Filters on the other hand are entirely dependent on the chosen columns.

This creates a tree of possible queries. At which point the nearest to the ground truth is chosen. I suppose this is the heart of the problem- what defines the best solution? What’s the loss function?

Loss Function

Queries could have too little information, too much information, incorrect columns, incorrect tables, incorrect filters, etc. Would you assume that these pieces of the loss function are tunable hyperparameters?

Other questions that come to mind…

How important is the size of the results? Should aggregations be taken like COUNT and AVG and take this into account for the loss function- but how useful is that information?

1 Like

One of Bengio’s students, Joseph Turian, did “Upshot” which was a Seq2SQL company that got acquired in 2015. Might want to see if he published, or if now that his handcuffs are off, he’d talk about it.

3 Likes

Nope. I mean a standard LSTM generator. But at each time step check that the proposed token is legal based on the grammar (which could be implemented as a state machine, for instance).

2 Likes

Thanks for the time you took to provide your input- it’s much appreciated.

That’s a really interesting idea. I’m definitely interested in working on an implementation of this. One thing I’m thinking is that in order to know how everything can hook together, it might be useful to have table and column metadata for whatever database the english to sql model is working on.

1 Like

This reminds me of the scientific paper generator that you went over at some point in the past (can’t remember the name, but it started with an L i think).

Sounds interesting, will read through it later today.

Sorry for a stupid question, but can you clarify what do you mean by a “state machine”?

https://blog.markshead.com/869/state-machines-computer-science/

3 Likes

So…the simplest approach will be

  1. Train a recurrent network with SQL queries text to predict the next word.
  2. Create some logics to check whether a query is valid or not, ie. FROM can only come after SELECT SOMETHING
  3. Just feed the valid prediction to the network and generate the next word?

For each token generated:

  1. generate list of probabilities of every word in vocab
  2. filter that list down to only those that are legal SQL at this position
  3. pick the best of that list.

Optionally, improve the above using beam search.

2 Likes

An interesting paper for ya’all:

2 Likes

@jsonm any progress on this one? :slight_smile:

The dataset I tried using first is pretty messy (SENLIDB) lots of titles are completely useless (like random characters) and many descriptions are blank.

I started on the method Jeremy discussed, generating random SQL, and will try to put some more time in.

@KevinB has put in some work separately attempting SQL -> English.

But I definitely need to spend more time- honestly I feel like I’m playing catch up with the course material so I’d like to spend time on GANs and am putting this on hold temporarily (I work full time so time is tight).

Though there’s a paper that discusses using the techniques we’ve been learning, with nlp, so that’d be a pretty cool application (SQL -> English -> SQL)

@nok if you’re worried about things already being completed- don’t! This will take time and probably won’t get the attention it deserves for a bit. We keep learning so much stuff! I guess it’s hard keeping up and actually intuitively understanding everything.

I use the same dataset and start from generating random SQL with RNN.

My current approach is simple basically is just something similar to lesson6 (part1). A RNN generate random words… Haven’t figured out how to build a state machine or so to make sure the SQL Syntax is valid though… any thoughts?

I am also working full-time so I take this project as an exercise and make sure I have understood the concept. I am also slightly lagged behind, haven’t revise lesson 11 and 12.

My current goal is to implement a translation model that uses to monolingual corpuses (starting with English and Spanish) and then using that skeleton, build the same thing with sql as one of the corpuses. I honestly have no idea how well it will work, but it could allow us to build everything without requiring a corpus that has both sql and English in it.

It can be simplified to start- something like the entry point to your state machine is “select” (I.e. check to see if it starts with “select”, in reality it could be other things like “with”.

Then it can be “*” or a comma delimited list of columns (that exist in the schema, but you could leave that for later).

Then it can only be “from”.

Then it can be a comma delimited list of table names, optionally with "as "

Then things get a little complicated if you’re incorporating joins, otherwise just handle filters (“where” clause delimited by “and” or “or”).

There are also aggregation functions, group by and order by, etc.

Honestly I would probably just use regex to start- you could improve it later to be able to select columns from the result of a select statement too etc.

It might be easiest to find a python SQL linter- if a good one exists…

A simple SELECT X FROM TABLE WHERE X > Ywould be relatively strict forward, probably regex can do the job. But as you said, complexity adds up quickly if JOIN and schema have to be considered.

Python validator may be a good option, I will look into it. But I guess I will spend time on catching up the lecture material tomorrow.

This is the result I got so far… will have to fixed up the tokenization, seems space is missing somehow.