Automatic Index Recommendations in PostgreSQL using pg_qualstats and hypopg

PostgreSQLIn my previous blog post, we have seen how the extension hypopg can be helpful in creating hypothetical indexes in PostgreSQL. If you have read it already, you know that it is very useful in verifying whether an Index can improve the performance of an SQL without having to create it in reality. Considering this, is there also a way to automatically suggest which indexes can improve the performance of some of your SQL’s? The answer is… YES!

In this blog post, we will see how we can get a direct recommendation to improve a specific query, as seen in the following snippet.

Without any further delay, let’s discuss the extension pg_qualstats which enables us to achieve this requirement for PostgreSQL versions 9.4 or later. Following that, we will take a look at a logic which could automatically suggest what indexes would be helpful for query optimizations – without much manual work.

All of the logic discussed in this blog post is reproducible, so please feel free to do so using the commands and try tuning your custom queries in test environments.

pg_qualstats

pg_qualstats is an extension developed by the POWA Team to uncover the need for storing predicates (quals). It stores the predicates found in WHERE clauses and JOIN conditions. This helps us analyze a query execution and opens up the possibility of automatic query optimizations.

When we query pg_stat_statements like extensions, we only see the prepared SQL or the SQL without any bind variable values. As you cannot perform an EXPLAIN (to see the execution plan of a Query) without the quals, the only option available is to see if that query was logged in the PostgreSQL log file and then identify the parameters passed to it. Or you can maybe use some arbitrary values, but this requires some manual intervention and time. But, when you create this extension: pg_qualstats, it stores queries along with the actual quals based on the sample rate ( pg_qualstats.sample_rate) specified.

Creating the extension: pg_qualstats

For RedHat/CentOS, we can install it using the packages available in the PGDG repository. Once you have added the PGDG repo, simply run the following command:

Similarly for Ubuntu/Debian:

Once installed, you must add pg_qualstats to shared_preload_libraries. This requires a restart. As I am also using pg_stat_statements to get the queryid associated with each query recorded by pg_qualstats, I have the following setting in my postgresql.conf file:

Modifying the above parameter requires a restart of your PostgreSQL instance.

Some GUCs you should know

In order to start taking advantage of this extension, you may have to set some of the GUCs (Grand Unified Configuration) in your PostgreSQL server. These can be set using ALTER SYSTEM or by manually adding the associated entries in your postgresql.conf or postgresql.auto.conf files.

pg_qualstats.enabled: true or false (to enable or to disable pg_qualstats). ON by default.

pg_qualstats.track_constants: true or false (to enable tracking of each constant. False would reduce the number of entries to track predicates.)

pg_qualstats.max: The number of queries tracked. Defaults to 1000.

pg_qualstats.resolve_oids: Just store the oids or resolve them and store at query time. This takes additional space.

pg_qualstats.track_pg_catalog: Defaults to false. Whether or not the predicates of the objects in pg_catalog schema should be computed.

pg_qualstats.sample_rate: Default is -1. The fraction of queries to be sampled. -1 defaults to (1/max_connections). When set to 1, everything is sampled. Similarly when set to 0.1, one out of 10 queries are sampled.

Quals and Query Examples

In order to see this in action, let us use sysbench-tpcc to generate some SQL traffic, and then see some of the details captured.

Before running sysbench-tpcc, I have created all the required extensions as seen in the following log. In order to see the queryid (same as the queryid column of pg_stat_statements) associated with each qual captured, it is important to have the extension: pg_stat_statements created. Similarly, to create hypothetical indexes, we need to have the extension: hypopg created.

For this test, I have set  pg_qualstats.sample_rate to 1. Setting this to 1 captured every qual. We can, of course, reduce the sampling to ensure not everything is captured.

Installing and running sysbench-tpcc on RedHat/CentOS

After running the benchmark for 10 seconds, we are now ready to query the view: pg_qualstats_indexes that gives us a nice view of the columns on which there are no indexes.

Automatic Index Recommendations

In the previous section, we have seen the columns (of tables used in some of the SQLs in WHERE condition) which do not have indexes on them. Indexes on these columns can be considered as recommended indexes. To make this more meaningful, I have written the following function to store the queries that are using these columns as predicates, along with their execution plans before and after creating the recommended index. Using this data, we can understand whether the recommended index is really helpful.

The above function uses the following logic:

  1. Create a Table with name: public.idx_recommendations where the results are stored. It stores the queries on which the table and column names mentioned in the output of pg_qualstats_indexes are used as predicates, along with their execution plan before and after creating the hypothethical indexes.
  2. Get the list of Queries (candidates for query tuning) along with their queryid and the attributes on which an index is recommended for each query. The SQL in the above FOR LOOP is built using a slight modification to the existing view: pg_qualstats_indexes.
  3. An example query with predicates can be obtained using the function : pg_qualstats_example_query() provided by pg_qualstats.
  4. Run EXPLAIN on the example query to store it in the table: public.idx_recommendations.
  5. Use hypopg to create a hypothetical index on the attributes mentioned as columns without an index on them.
  6. Run EXPLAIN on the query again and use hypopg_reset() to drop the hypothetical index created.

When I have validated if any of the queries generated by sysbench-tpcc need some tuning using indexing, it was not a surprise that none of those need any further indexing. Hence, for the purpose of a demo, I have created a table and ran a few queries as following.

Now, let us check if we can find the indexes that could improve the SQL COST of execution in reality. As I have mentioned earlier, none of the queries run by sysbench-tpcc needed any further improvement through indexing, so we see the improvement only for the 2 select statements I ran above.

As this function is storing the results into a table: public.idx_recommendations, we can query that and see the hypothetical index that has improved the total cost of that query.

Conclusion

With this experiment, we see that we can use hypopg and pg_qualstats  to automate index recommendations. The automation logic is currently limited to B-Tree Indexes only. Though it has a very negligible impact on performance through some minimalistic resource consumption, it can be considered by developers while coding an application logic. Developers could easily enable sampling for each query and see what indexes can be used to improve which query, and then implement the changes in Production. The function logic I have created above is just an experiment in automatic index recommendations and you may re-use the same upon additional testing. Special thanks again to POWA Team who have contributed to the very useful extension pg_qualstats in PostgreSQL.

Let’s discuss this approach on Hacker News

Share this post

Comments (7)

  • Ramesh Reply

    Wow.. this is excellent.. thanks Percona and Avinash.
    Keep doing the great work

    July 22, 2019 at 1:09 pm
  • James Finnerty Reply

    Adding a secondary index on a frequently-updated column that was not previously indexed could disable the HOT optimization and could have very negative consequences on fragmentation and performance. Do the index recommendations take the effect on HOT optimization into account?

    July 25, 2019 at 6:29 am
    • Avinash Vallarapu Reply

      This logic only checks if the index can decrease the total cost of execution by running an EXPLAIN. If the optimizer picks up the index, it means that the index can improve the query. Before creating that index in reality, you must manually check any other side effects of creating that index. Today, this logic does not check any other aspects except for the one criteria – is it being considered by the optimizer or not.

      July 25, 2019 at 1:47 pm
  • LazyDBA247 Reply

    need to update the function find_usable_indexes() to support columns with case sensitive names

    fixed version:
    CREATE OR REPLACE FUNCTION find_usable_indexes()
    RETURNS VOID AS
    $$
    DECLARE
    l_queries record;
    l_querytext text;
    l_idx_def text;
    l_bef_exp text;
    l_after_exp text;
    hypo_idx record;
    l_attr record;
    /* l_err int; */
    BEGIN
    CREATE TABLE IF NOT EXISTS public.idx_recommendations (queryid bigint,
    query text, current_plan jsonb, recmnded_index text , hypo_plan jsonb);
    FOR l_queries IN
    SELECT t.relid, t.relname, t.queryid, t.attnames, t.attnums,
    pg_qualstats_example_query(t.queryid) as query
    FROM
    (
    SELECT qs.relid::regclass AS relname, qs.relid AS relid, qs.queryid,
    string_agg(DISTINCT quote_ident(attnames.attnames),’,’) AS attnames, qs.attnums
    FROM pg_qualstats_all qs
    JOIN pg_qualstats q ON q.queryid = qs.queryid
    JOIN pg_stat_statements ps ON q.queryid = ps.queryid
    JOIN pg_amop amop ON amop.amopopr = qs.opno
    JOIN pg_am ON amop.amopmethod = pg_am.oid,
    LATERAL
    (
    SELECT pg_attribute.attname AS attnames
    FROM pg_attribute
    JOIN unnest(qs.attnums) a(a) ON a.a = pg_attribute.attnum
    AND pg_attribute.attrelid = qs.relid
    ORDER BY pg_attribute.attnum) attnames,
    LATERAL unnest(qs.attnums) attnum(attnum)
    WHERE NOT
    (
    EXISTS
    (
    SELECT 1
    FROM pg_index i
    WHERE i.indrelid = qs.relid AND
    (arraycontains((i.indkey::integer[])[0:array_length(qs.attnums, 1) – 1],
    qs.attnums::integer[]) OR arraycontains(qs.attnums::integer[],
    (i.indkey::integer[])[0:array_length(i.indkey, 1) + 1]) AND i.indisunique)))
    and qs.queryid not in (-2862297032983545665)
    GROUP BY qs.relid, qs.queryid, qs.qualnodeid, qs.attnums) t
    GROUP BY t.relid, t.relname, t.queryid, t.attnames, t.attnums

    LOOP
    /* RAISE NOTICE ‘% : is queryid’,l_queries.queryid; */
    execute ‘explain (FORMAT JSON) ‘||l_queries.query INTO l_bef_exp;
    execute ‘select hypopg_reset()’;
    execute ‘SELECT indexrelid,indexname FROM hypopg_create_index(”CREATE INDEX on ‘||quote_ident(l_queries.relname::text)||'(‘||l_queries.attnames||’)”)’ INTO hypo_idx;
    execute ‘explain (FORMAT JSON) ‘||l_queries.query INTO l_after_exp;
    execute ‘select hypopg_get_indexdef(‘||hypo_idx.indexrelid||’)’ INTO l_idx_def;
    INSERT INTO public.idx_recommendations (queryid,query,current_plan,recmnded_index,hypo_plan)
    VALUES (l_queries.queryid,l_querytext,l_bef_exp::jsonb,l_idx_def,l_after_exp::jsonb);
    END LOOP;
    execute ‘select hypopg_reset()’;
    END;
    $$ LANGUAGE plpgsql;

    February 9, 2020 at 3:43 am
    • LazyDBA247 Reply

      Please remove: and qs.queryid not in (-2862297032983545665)
      my bad..

      February 9, 2020 at 5:17 am
      • jflambert Reply

        Thanks Lazy DBA. I noticed the function doesn’t work either. Hopefully the author fixes his article.

        March 25, 2020 at 11:02 am
  • Alexander Nikitin Reply

    Hello! I repeated what is described in this article, but could not find the view pg_qualstats_indexes. There are only the following relationships: pg_qualstats, pg_qualstats_pretty, pg_qualstats_all and pg_qualstats_by_query (select relname from pg_class where relname like ‘pg_qualstats%’;)
    Versions: PostgreSQL 11.8, pg_qualstats 2.0.1
    What am i doing wrong?

    May 20, 2020 at 4:44 am

Leave a Reply