Open Source AI Database Agent Part 1: Introduction

Generative AI is top of mind for many engineers. The questions of how it can be applied to solve business problems and boost productivity are still up in the air. Recently I wrote a blog post about the impact of AI on platform engineers, where I talked about various AI Agents and how they can interact with different systems and APIs. The belief is that AI Agents or a combination of those would heavily impact how we work with computers and systems today.

One of the examples that I want to look into today is a Database AI Agent. We may call it a text-to-SQL model that translates human-like requests into SQL language and produces human-like responses.

We will use a combination of open source tools:

Langchain – An open source python library that offers developers a comprehensive set of resources to develop applications that run on Large Language Models (LLMs).
Percona Distribution for PostgreSQL—We will use the Distribution to host the database with sample data. I will use Pagila, the famous Sakila example database ported to PostgreSQL. The structure of the database can be found here.

All the scripts from this blog post are available in this GitHub repository.

This blog post series has two parts:

Open Source AI Database Agent Part 1: Introduction (this one) – Covers the basics, we create a simple database AI agent.
Open Source AI Database Agent Part 2: OSADA—We fine-tuned the scripts, expanded its capabilities, and added the graphical UI.

The setup

Provision the database

Install Percona Distribution for PostgreSQL

I’m running Ubuntu, so I will follow our Quickstart guide for apt repositories.

Install percona-release

wget https://repo.percona.com/apt/percona-release_latest.$(lsb_release -sc)_all.deb
sudo dpkg -i percona-release_latest.$(lsb_release -sc)_all.deb
sudo apt update

wget https://repo.percona.com/apt/percona-release_latest.$(lsb_release -sc)_all.deb

sudo dpkg -i percona-release_latest.$(lsb_release -sc)_all.deb

sudo apt update

Setup the apt repo for PostgreSQL 16

sudo percona-release setup ppg-16

1	sudo percona-release setup ppg-16

Install the distribution

sudo apt install percona-ppg-server-16

1	sudo apt install percona-ppg-server-16

Bootstrap the database

Get Pagila database SQLs from GitHub:

git clone https://github.com/devrimgunduz/pagila

1	git clone https://github.com/devrimgunduz/pagila

Create the database:

sudo su postgres
psql
CREATE DATABASE pagila;
q

sudo su postgres

psql

CREATE DATABASE pagila;

Create the schema:

cat pagila/pagila-schema.sql | psql -U postgres -d pagila

1	cat pagila/pagila-schema.sql \| psql -U postgres -d pagila

Insert the data:

cat pagila/pagila-data.sql | psql -U postgres -d pagila

1	cat pagila/pagila-data.sql \| psql -U postgres -d pagila

I will also create the user pagila to access the database from my scripts:

sudo su postgres

psql
create user pagila with password 'SOMEPASSWORD';
c pagila
GRANT ALL PRIVILEGES ON DATABASE pagila to pagila;
grant select on all tables in schema public to pagila;
q

psql

create user pagila with password 'SOMEPASSWORD';

c pagila

GRANT ALL PRIVILEGES ON DATABASE pagila to pagila;

grant select on all tables in schema public to pagila;

Setting up Langchain

With Langchain, any LLM can be used. In our examples, we will use Google Gemini models, which require a key. You can also use OpenAI in the next iteration of the agent; read more in part two of this series..

Step 0 – Make sure Gemini works

Start with installing the required Python packages. The usual recommendation is to use pip:

pip install --upgrade --quiet langchain langchain-google-genai pillow

1	pip install --upgrade --quiet langchain langchain-google-genai pillow

A simple script, 0_langchain_gemini_test.py, asks Gemini, “What can you tell me about Percona?” You can set the GOOGLE_API_KEY variable, but if not, the script will ask for it.

import getpass
import os
from langchain_google_genai import ChatGoogleGenerativeAI

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Provide your Google API Key")

llm = ChatGoogleGenerativeAI(model="gemini-pro")
result = llm.invoke("What can you tell me about Percona?")
print(result.content)

import getpass

import os

from langchain_google_genai import ChatGoogleGenerativeAI

if "GOOGLE_API_KEY" not in os.environ:

os.environ["GOOGLE_API_KEY"] = getpass.getpass("Provide your Google API Key")

llm = ChatGoogleGenerativeAI(model="gemini-pro")

result = llm.invoke("What can you tell me about Percona?")

print(result.content)

$ python3 0_langchain_gemini_test.py


**Percona**

**About:**

* Percona is a leading provider of open source database software and services for MySQL, PostgreSQL, MongoDB, and MariaDB.
* It was founded in 2006 by former MySQL engineers involved in the development and support of the open source MySQL database.
* Percona's mission is to provide reliable, scalable, and high-performance database solutions for businesses and organizations worldwide.
...

$ python3 0_langchain_gemini_test.py

**Percona**

**About:**

* Percona is a leading provider of open source database software and services for MySQL, PostgreSQL, MongoDB, and MariaDB.

* It was founded in 2006 by former MySQL engineers involved in the development and support of the open source MySQL database.

* Percona's mission is to provide reliable, scalable, and high-performance database solutions for businesses and organizations worldwide.

...

It is working.

Query the database

We are done with the initial setup; let’s query the database.

Install the required libraries:

pip install --upgrade --quiet langchain-community psycopg2-binary

1	pip install --upgrade --quiet langchain-community psycopg2-binary

To experiment with querying the database we use 1_langchain_gemini_postgresql.py script. The script takes a user question in a human-readable format as an argument and produces the response.

To do that we will use the create_sql_agent function that constructs an agent specifically designed to interact with SQL databases. This agent combines a language model (like Gemini) with specialized tools that understand and manipulate SQL queries.

In addition to GOOGLE_API_KEY, we need to provide details about the database:

dbname="pagila"
username=os.environ["PG_USERNAME"]
password=os.environ["PG_PASSWORD"]
hostname="localhost"
port="5432"

dbname="pagila"

username=os.environ["PG_USERNAME"]

password=os.environ["PG_PASSWORD"]

hostname="localhost"

port="5432"

Make sure that you set PG_USERNAME and PG_PASSWORD as environment variables.

Let’s start with the simple question:

$ python3 1_langchain_gemini_postgresql.py "How many films are there in the database?"

> Entering new SQL Agent Executor chain...
I should find the number of rows in the film table
Action: sql_db_query
Action Input: SELECT COUNT(*) FROM film[(1000,)]I now know the final answer
Final Answer: 1000
> Finished chain.

{'input': 'How many films are there in the database?', 'output': '1000'}

$ python3 1_langchain_gemini_postgresql.py "How many films are there in the database?"

> Entering new SQL Agent Executor chain...

I should find the number of rows in the film table

Action: sql_db_query

Action Input: SELECT COUNT(*) FROM film[(1000,)]I now know the final answer

Final Answer: 1000

> Finished chain.

{'input': 'How many films are there in the database?', 'output': '1000'}

Without providing any additional context or explaining the structure of the database, I was able to get a proper answer. The way it works is the following:

But this works only for simple questions, where it is quite straightforward. For more complex questions, multiple requests are required for the database and LLMs, so-called chains. It is also quite interesting, that chains help to fix some errors in the query on the fly. Below, you can see how the text-to-sql script worked for two more questions:

Which 5 actors have appeared in the most films?
What are the most popular film categories? Show the top 3 with the number of films in that category.

Known issues

Sometimes, the Large Language Model can’t fix the error in the SQL query, and the agent gets stuck in the loop. It can happen for the same question that was successfully answered before. In the case of the loop, the agent will notify you about it:

...
[SQL: ```sql
SELECT actor_name, COUNT(*) AS num_films
FROM film_actor
GROUP BY actor_name
ORDER BY num_films DESC
LIMIT 5;
```]

(Background on this error at: https://sqlalche.me/e/20/f405)The query is still not enclosed in triple backticks.
Action: sql_db_query
Action Input: ```sql
SELECT actor_name, COUNT(*) AS num_films
FROM film_actor
GROUP BY actor_name
ORDER BY num_films DESC
LIMIT 5;
```Error: (psycopg2.errors.SyntaxError) syntax error at or near "```"
LINE 1: ```sql
        ^
[SQL: ```sql
SELECT actor_name, COUNT(*) AS num_films
FROM film_actor
GROUP BY actor_name
ORDER BY num_films DESC
LIMIT 5;
```]
(Background on this error at: https://sqlalche.me/e/20/f405)

> Finished chain.
{'input': 'Which 5 actors have the most films they appeared in? Show names of actors and the corresponding number of films they appeared in.', 'output': 'Agent stopped due to iteration limit or time limit.'}

...

[SQL: ```sql

SELECT actor_name, COUNT(*) AS num_films

FROM film_actor

GROUP BY actor_name

ORDER BY num_films DESC

LIMIT 5;

```]

(Background on this error at: https://sqlalche.me/e/20/f405)The query is still not enclosed in triple backticks.

Action: sql_db_query

Action Input: ```sql

SELECT actor_name, COUNT(*) AS num_films

FROM film_actor

GROUP BY actor_name

ORDER BY num_films DESC

LIMIT 5;

```Error: (psycopg2.errors.SyntaxError) syntax error at or near "```"

LINE 1: ```sql

[SQL: ```sql

SELECT actor_name, COUNT(*) AS num_films

FROM film_actor

GROUP BY actor_name

ORDER BY num_films DESC

LIMIT 5;

```]

(Background on this error at: https://sqlalche.me/e/20/f405)

> Finished chain.

{'input': 'Which 5 actors have the most films they appeared in? Show names of actors and the corresponding number of films they appeared in.', 'output': 'Agent stopped due to iteration limit or time limit.'}

The potential impact

Companies generate tons of data that they later use to make better business decisions. A perfectly structured BI system requires the skills of analysts and engineers. In this article, we saw how generative AI can help digest the data from rudimentary databases, but the expectation is that it will mature over time. There will be no need to learn SQL or understand how data is stored; AI will make sense of it automagically. This is going to impact how businesses store, manage, and process data.

In the second part of the blog post, we productize the script:

Add UI and a docker container for easier onboarding
Add MySQL to supported databases
Add OpenAI and custom LLMs

Check out Open Source AI Database Agent Part 2: OSADA!

MySQL 5.7
Support

Compare Percona to Leading Database Solutions

Software
Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Open Source AI Database Agent Part 1: Introduction

The setup

Provision the database

Install Percona Distribution for PostgreSQL

Bootstrap the database

Setting up Langchain

Step 0 – Make sure Gemini works

Query the database

Known issues

The potential impact

Related Blog Articles

RECOMMENDED ARTICLES

Running Databases on Kubernetes: A Practical Guide to Risks, Benefits, and Best Practices

Building a Multi-Cloud Strategy: Cut Costs, Improve Resilience, and Avoid Lock-In

Migrate to Freedom: Choosing a Truly Open Source PostgreSQL Operator

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7 Support

Compare Percona to Leading Database Solutions

Software Downloads

Valkey Contribution

Product Documentation

Resource Hub

Why Percona for MongoDB?

Why Percona for PostgreSQL?

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Open Source AI Database Agent Part 1: Introduction

The setup

Provision the database

Install Percona Distribution for PostgreSQL

Bootstrap the database

Setting up Langchain

Step 0 – Make sure Gemini works

Query the database

Known issues

The potential impact

About the Author

Share This Post!

Stay up to date with the Percona Blog

Related Blog Articles

RECOMMENDED ARTICLES

Running Databases on Kubernetes: A Practical Guide to Risks, Benefits, and Best Practices

Building a Multi-Cloud Strategy: Cut Costs, Improve Resilience, and Avoid Lock-In

Migrate to Freedom: Choosing a Truly Open Source PostgreSQL Operator

MOST POPULAR ARTICLES

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL Performance Tuning: Maximizing Database Efficiency and Speed

The Ultimate Guide to Open Source Databases

MySQL 5.7
Support

Software
Downloads