Proposal for Global Indexes in PostgreSQL

PostgreSQLA global index, by very definition, is a single index on the parent table that maps to many underlying table partitions. The parent table itself does not have a single, unified underlying store so it must, therefore, retrieve the data satisfying index constraints from physically distributed tables. In very crude terms, the global index accumulates data in one place so that data spanning across multiple partitions are accessed in one go as opposed to individually querying each partition.

Currently, there is no Global Index implementation available in PostgreSQL, and therefore I want to propose a new feature.  I have sent a proposal to the community, and that discussion is now started. In this proposal, I ask for Global Index support just for B-Tree and will consider other index methods later.

Terminologies used

  • Global Indexes

 A one-to-many index, in which one index map to all the partitioned tables. 

  • Partitioned Index (Index Partitioning)

When global indexes become too large, then those are partitioned to keep the performance and maintenance overhead manageable. These are not within the scope of this work.

  • Local Index

A local index is an index that is local to a specific table partition; i.e. it doesn’t span across multiple partitions. So, when we create an index on a parent table, it will create a separate index for all its partitions. PostgreSQL uses the terminology of “partitioned index” when it refers to local indexes. This work will fix this terminology for PostgreSQL so that the nomenclature remains consistent with other DBMS.

Why Do We Need Global Index in PostgreSQL?

A global index is expected to give two very important upgrades to the partitioning feature set in PostgreSQL. It is expected to give a significant improvement in read-performance for queries targeting multiple local indexes of partitions, as well as adding a unique constraint across partitions.

Unique Constraint

Data uniqueness is a critical requirement for building an index. For global indexes that span across multiple partitions, uniqueness will have to be enforced on index column(s). This effectively translates into a unique constraint.

Performance

Currently, the pseudo index created on the parent table of partitions does not contain any data. Rather, it dereferences to the local indexes when an index search is required. This means that multiple indexes will have to be evaluated with data to be combined thereafter. However, with the global indexes, data will reside with the global index declared on the parent table. This avoids the need for multi-level index lookups, so read performance is expected to be significantly higher in some cases. There will, however, be a negative performance impact during write (insert/update) of data. This is discussed in more detail later on.

Creating a Global Index – Syntax

A global index may be created with the addition of a “GLOBAL” keyword to the index statement. Alternatively, one could specify the “LOCAL” keyword to create local indexes on partitions. We are suggesting to call this set of keywords: “partition_index_type”. By default, partition_index_type will be set as LOCAL. Here is a sample of the create index syntax.

Pointing Index to Tuple

Currently, CTID carries a page and offset information for a known heap (table name). However, in the context of global indexes, this information within an index is insufficient. Since the index is expected to carry tuples from multiple partitions (heaps), CTID alone will not be able to link an index node to a tuple. This requires carrying additional data for the heap name to be stored with each index node.

Optimizer

The challenge with optimizer is a selection between local and global indexes when both are present. There have been many open questions, including evaluating the cost of scanning a global index. When should the LOCAL index be preferred over the GLOBAL index and vice versa?

Write Performance and Vacuum

There will be some write performance degradation because every change in partition tables must propagate upwards to the GLOBAL index on the parent table. This can be thought of as another index on a table, however, the [slight] performance degradation will be due to the fact that the GLOBAL index may carry a much bigger dataset with data from multiple partitions resulting in a higher tree traversal and update time. This applies to both write and vacuum processes.

It is still an open question, though, on how this will be handled within the code and how we can better optimize this process.

Conclusion

As we know most major DBMS engines that have partitioning support also have support for the Global Index. PostgreSQL has very powerful partitioning support but lacks the support of the Global Index. Global Index not only ensures the uniqueness across partitioning but also improves read performance.  I have sent the proposal to PostgreSQL Community and while a discussion has been started, it is a slow process. If you are an engineer and want to contribute, respond to that thread in the community. If you are a user and have some uses cases, please share that on the same mail chain. 

Discuss on Hacker News

Share this post

Comment (1)

  • Alexandre hadjinlian guerra Reply

    Its helpful to have it specially for sequences and IDs… Some designs overcome this limitation, but takes additional effort to deal with. Keep this gem close

    November 20, 2019 at 2:10 pm

Leave a Reply