Automation of SQL Query Validation

16. August 2024

The modern way of work shifts more and more to using large language models (LLM) such as OpenAI’s GPT-4 or Meta’s Llama 3. We incorporate the help of those assistants into our daily routines. This doesn’t just cover plain text generation but also helps with structured output in coding tasks.

When it comes to developing systems of any kind based on data stored in a database, it is hard to stay away from Structured Query Language (SQL). SQL is the go-to language for data manipulation and is used in various different dialects such as MySQL, PostgreSQL or Oracle. Every dialect has its nuances and differences from others. This makes the task of validating SQL-queries not as easy as it may seem.

When we, as developers, create SQL-queries, we mostly know the underlying database and the task that we want to fulfill with a SQL query. We have a rough plan in mind of what we need to do to get the output we want. When using LLMs in SQL query generation, we want to get the same results we would come up with, but much faster and out of the box. This leads to some challenges we need to overcome. Let’s delve into some common issues that arise when generating SQL statements.

1) Understanding Context and Schema

One of the most crucial but challenging tasks in generating SQL queries is understanding the context and schema of the database. Large Language Models are trained on vast amounts of text but don't have specific information about a random database's schema, tables and relationships. In the same way as we as developers need to understand the database we want to query data from, an LLM needs the same information of what tables and columns to get specific data from. The quality of the output depends on how the schema is presented to the LLM. There is also the possibility to finetune an LLM on specific schemas, but that holds the risk of getting too static and generating invalid output again.

2) Language Ambiguity

Natural languages are inherently ambiguous and context-sensitive. Different users may articulate the same use case in various ways, leading to multiple interpretations. This makes it harder to get a deterministic, perfect outcome for each request given to the LLM. The input might also differ in the proper explanation of the task. This means that a task might not be properly explained, or the connection to the database to be queried cannot be textually presented. This also affects building the context needed to generate the query output.

3) Hallucination

LLMs are trained on a vast variety of text and examples. One model will produce better output for SQL query generation, another one will perform worse. For most LLMs, we don’t know what they were trained with, or even if we know, we hardly keep track of the huge amount of data presented to the models. Nowadays, every single LLM we are using will be able to generate SQL query output. But with this power also come some risks. In the query outcome, there may be references to columns that do not exist in our specific use case. This is a classic example of LLMs hallucinating, preventing us from querying our database for the desired data.

4) Syntax Errors

Although LLMs know how to produce SQL queries, it is still possible that syntactical errors are created. This problem might increase, as soon as different SQL dialects are asked to be used. Depending on the LLM used, some dialects might perform better than others. Some might not even be considered, although they are referenced in the input prompt. But if an LLM has a bias to a specific SQL dialect to use, it will tend to generate output using that structure for the SQL output.

5) Logical Errors

Even if the query is syntactically correct, it might not yield the desired results due to logical inconsistencies. For example, incorrect join conditions or where clauses. In the end, the user doesn’t get the output that is expected for the use case, although the query can be executed without any errors.

6) Performance and Efficiency

LLMs may produce syntactically correct SQL queries that are, however, not optimized for the performance. Poorly optimized queries can lead to long execution times and heavy loads on database servers. Ensuring that generated queries are optimized requires additional layers of validation and optimization.

7) Maintenance

If a user follows the path of fine-tuning specific use cases for a given data structure, the maintenance of LLMs for adapting to new patterns, languages and changes can be labor- and cost-intensive. Ensuring the model evolves with changing user requirements and keeps in line with the current database is an ongoing challenge. When it comes to continuous development for fine-tuning, a proper pipeline is needed.

Given the challenges, automating the validation of created queries is very practical/helpful. This is where the Python libraries SQLFluff and SQLGlot come into play. They can help you to validate the generated SQL output in a few easy steps.

At first, let’s take a look at the capabilities of the two mentioned libraries:

SQLFluff

SQLFluff is a dialect-flexible, configurable SQL linter. It is like a spell checker for SQL queries. It examines SQL code to ensure it follows predefined best practices and standards. It is even capable of auto-fixing some linting errors.

SQLGlot

SQLGlot is more like a multilingual translator for SQL. It reads SQL queries and can convert them between different SQL dialects. It parses SQL queries to an abstract level that allows it to easily transpile queries into different dialects. This becomes especially handy for managing database systems with multiple dialects, each with its own syntax nuances.

Using SQLFluff to validate generated SQL-queries

1) Installation and Setup
Just like any software, SQLFluff has to be added to your project. This is straightforward and done in a simple install command.
2) Configuration
SQLFluff supports various SQL dialects, so in this step, we define which one should be checked for. Also, we can add and exclude rules that we want to check for when checking a SQL query.
3) Automatic Linting
Once all is configured, we need to feed the generated SQL query to the SQLFluff linter. It will review the query and identify any issues.
4) Auto-Fix
With the auto-repair function that SQLFluff provides, we can fix possible issues right away in one easy step.
5) Continuous Quality Control
Steps 3 and 4 can be set up to check and auto-fix each and every SQL query right after they are generated, so we have direct feedback if a query is valid or not.

Using SQLGlot to validate SQL-queries

1) Installation
Similar to SQLFluff, we also need to install SQLGlot and add it to our project.
2) Parsing SQL query
In the next step, we feed the generated query into SQLGlot. It parses the queries and also checks if they are syntactically correct. We don’t get the in-depth overview of a linter like SQLFluff though.
3) Transpile
If we need to support various dialects in our database system, we can use this step to convert the generated query into the dialects that we need. This step can also be used if we generate queries for a specific dialect, because our LLM performs significantly better for the given dialect.
4) Extract referenced database tables and fields
In SQL queries, we access specific fields of a database. In complex queries, we might even access several tables to get the data we want. When using LLMs to generate SQL queries, there is no guarantee that the used fields really exist in the database. With SQLGlot we can identify the tables and fields that are tried to be accessed via the query. Using a simple cross-check of the structure of our database we can easily find out if the used fields exist in our database or not.
5) Automated Workflow Integration
Just like SQLFluff, we can integrate SQLGlot into our continuous integration pipelines, ensuring that our generated SQL queries are compatible with the intended database environment.

Advantages of Using SQLFluff and SQLGlot for LLM-Generated SQL

- Error Minimization:
These tools help catch and correct errors that escape the initial LLM generation process
- Consistency and Standards:
Enforcing coding standards across all SQL queries ensures maintainability and readability
- Time Efficiency:
Automation saves valuable development time, allowing developers to focus on more complex tasks
- Increase Reliability:
Continuous validation leads to more reliable SQL queries, reducing the risk of bugs in production
- Cross-Database Flexibility:
SQLGlot’s transpiling capabilities make it easy to adapt queries for different database systems, adding a layer of flexibility to your operations.

Conclusion

Large Language Models like GPT-4 or Llama 3 offer a powerful way to generate SQL queries quickly. However, to ensure these queries are efficient, accurate and maintainable, incorporating tools like SQLFluff and SQLGlot is essential. By validating and optimizing LLM-generated SQL, these tools provide a crucial safety net, enhancing the reliability and performance of our database interactions. By integrating SQLFluff and SQLGlot in our workflow we can bring out the best in our LLM-generated SQL queries.

About APICHAP

At APICHAP, we are dedicated to not only helping you create APIs quickly but also ensuring that your system remains safe and efficient over time. Our platform offers a complete API lifecycle management service that keeps your system up-to-date and free of technical debt. With features such as auto-monitoring and self-repair, we guarantee that your system remains secure, saving you both time and money.

Instead of incurring significant expenses at irregular intervals for system maintenance, APICHAP offers a convenient monthly license fee. This fee covers all application updates and ongoing monitoring, ensuring your system is always running optimally. Additionally, updating your integrations or building new features is included in your monthly subscription, providing you with comprehensive support and continuous improvements without unexpected costs.

By choosing APICHAP, you can focus on your core business activities while we handle the complexities of API maintenance and updates, ensuring your applications are always performing at their best.

The Author

Michael Weißenböck
CTO & Co-Founder @ APICHAP
LinkedIn

Bio:
Michael is a passionate AI enthusiast who has dedicated his career to exploring and advancing the field of artificial intelligence. With a strong research background, he has successfully completed numerous AI-related projects, consistently driven by a desire to create real-world applications that improve people's lives.