Basic Understanding of Bloat and VACUUM in PostgreSQL

Basic Understanding of Bloat and VACUUM in PostgreSQL

PREVIOUS POST
NEXT POST

VACUUM and Bloat PostgreSQLImplementation of MVCC (Multi-Version Concurrency Control) in PostgreSQL is different and special when compared with other RDBMS. MVCC in PostgreSQL controls which tuples can be visible to transactions via versioning.

What is versioning in PostgreSQL?

Let’s consider the case of an Oracle or a MySQL Database. What happens when you perform a DELETE or an UPDATE of a row? You see an UNDO record maintained in a global UNDO Segment. This UNDO segment contains the past image of a row, to help database achieve consistency. (the “C” in A.C.I.D). For example, if there is an old transaction that depends on the row that got deleted, the row may still be visible to it because the past image is still maintained in the UNDO. If you are an Oracle DBA reading this blog post, you may quickly recollect the error ORA-01555 snapshot too old . What this error means is—you may have a smaller undo_retention or not a huge UNDO segment that could retain all the past images (versions) needed by the existing or old transactions.

You may not have to worry about that with PostgreSQL.

Then how does PostgreSQL manage UNDO ?

In simple terms, PostgreSQL maintains both the past image and the latest image of a row in its own Table. It means, UNDO is maintained within each table. And this is done through versioning. Now, we may get a hint that, every row of PostgreSQL table has a version number. And that is absolutely correct. In order to understand how these versions are maintained within each table, you should understand the hidden columns of a table (especially xmin) in PostgreSQL.

Understanding the Hidden Columns of a Table

When you describe a table, you would only see the columns you have added, like you see in the following log.

However, if you look at all the columns of the table in pg_attribute, you should see several hidden columns as you see in the following log.

Let’s understand a few of these hidden columns in detail.

tableoid : Contains the OID of the table that contains this row. Used by queries that select from inheritance hierarchies.
More details on table inheritance can be found here : https://www.postgresql.org/docs/10/static/ddl-inherit.html

xmin : The transaction ID(xid) of the inserting transaction for this row version. Upon update, a new row version is inserted. Let’s see the following log to understand the xmin more.

As you see in the above log, the transaction ID was 646 for the command => select txid_current(). Thus, the immediate INSERT statement got a transaction ID 647. Hence, the record was assigned an xmin of 647. This means, no transaction ID that has started before the ID 647, can see this row. In other words, already running transactions with txid less than 647 cannot see the row inserted by txid 647. 

With the above example, you should now understand that every tuple has an xmin that is assigned the txid that inserted it.

Note: the behavior may change depending on the isolation levels you choose, would be discussed later in another blog post.

xmax : This values is 0 if it was not a deleted row version. Before the DELETE is committed, the xmax of the row version changes to the ID of the transaction that has issued the DELETE. Let’s observe the following log to understand that better.

On Terminal A : We open a transaction and delete a row without committing it.

On Terminal B : Observe the xmax values before and after the delete (that has not been committed).

As you see in the above logs, the xmax value changed to the transaction ID that has issued the delete. If you have issued a ROLLBACK, or if the transaction got aborted, xmax remains at the transaction ID that tried to DELETE it (which is 655) in this case.

Now that we understand the hidden columns xmin and xmax, let’s observe what happens after a DELETE or an UPDATE in PostgreSQL. As we discussed earlier, through the hidden columns in PostgreSQL for every table, we understand that there are multiple versions of rows maintained within each table. Let’s see the following example to understand this better.

We’ll insert 10 records to the table : scott.employee

Now, let’s DELETE 5 records from the table.

Now, when you check the count after DELETE, you would not see the records that have been DELETED. To see any row versions that exist in the table but are not visible, we have an extension called pageinspect. The pageinspect module provides functions that allow you to inspect the contents of database pages at a low level, which is useful for debugging purposes. Let’s create this extension to see the older row versions those have been deleted.

Now, we could still see 10 records in the table even after deleting 5 records from it. Also, you can observe here that t_xmax is set to the transaction ID that has deleted them. These deleted records are retained in the same table to serve any of the older transactions that are still accessing them.

We’ll take a look at what an UPDATE would do in the following Log.  

An UPDATE in PostgreSQL would perform an insert and a delete. Hence, all the records being UPDATED have been deleted and inserted back with the new value. Deleted records have non-zero t_xmax value.

Records for which you see a non-zero value for t_xmax may be required by the previous transactions to ensure consistency based on appropriate isolation levels.

We discussed about xmin and xmax. What are these hidden columns cmin and cmax ?

cmax : The command identifier within the deleting transaction or zero. (As per the documentation). However, both cmin and cmax are always the same as per the PostgreSQL source code.

cmin : The command identifier within the inserting transaction. You could see the cmin of the 3 insert statements starting with 0, in the following log.

See the following log to understand how the cmin and cmax values change through inserts and deletes in a transaction.

If you observe the above output log, you see cmin and cmax values incrementing for each insert.

Now let’s delete 3 records from Terminal A and observe how the values appear in Terminal B before COMMIT.

Now, in the above log, you see that the cmax and cmin values have incrementally started from 0 for the records being deleted. Their values where different before the delete, as we have seen earlier. Even if you ROLLBACK, the values remain the same.

After understanding the hidden columns and how PostgreSQL maintains UNDO as multiple versions of rows, the next question would be—what would clean up this UNDO from a table? Doesn’t this increase the size of a table continuously? In order to understand that better, we need to know about VACUUM in PostgreSQL.

VACUUM in PostgreSQL

As seen in the above examples, every such record that has been deleted but is still taking some space is called a dead tuple. Once there is no dependency on those dead tuples with the already running transactions, the dead tuples are no longer needed. Thus, PostgreSQL runs VACUUM on such Tables. VACUUM reclaims the storage occupied by these dead tuples. The space occupied by these dead tuples may be referred to as Bloat. VACUUM scans the pages for dead tuples and marks them to the freespace map (FSM). Each relation apart from hash indexes has an FSM stored in a separate file called <relation_oid>_fsm.

Here, relation_oid is the oid of the relation that is visible in pg_class.

Upon VACUUM, this space is not reclaimed to disk but can be re-used by future inserts on this table. VACUUM stores the free space available on each heap (or index) page to the FSM file.

Running a VACUUM is a non-blocking operation. It never causes exclusive locks on tables. This means VACUUM can run on a busy transactional table in production while there are several transactions writing to it.

As we discussed earlier, an UPDATE of 10 records has generated 10 dead tuples. Let us see the following log to understand what happens to those dead tuples after a VACUUM.

In the above log, you might notice that the dead tuples are removed and the space is available for re-use. However, this space is not reclaimed to filesystem after VACUUM. Only the future inserts can use this space.

VACUUM does an additional task. All the rows that are inserted and successfully committed in the past are marked as frozen, which indicates that they are visible to all the current and future transactions. We will be discussing this in detail in our future blog post “Transaction ID Wraparound in PostgreSQL”.

VACUUM does not usually reclaim the space to filesystem unless the dead tuples are beyond the high water mark.

Let’s consider the following example to see when a VACUUM could release the space to filesystem.

Create a table and insert some sample records. The records are physically ordered on the disk based on the primary key index.

Now, run ANALYZE on the table to update its statistics and see how many pages are allocated to the table after the above insert.

Let’s now see how VACUUM behaves when you delete the rows with emp_id > 500

In the above log, you see that the VACUUM has reclaimed half the space to filesystem. Earlier, it occupied 6 pages (8KB each or as set to parameter : block_size). After VACUUM, it has released 3 pages to filesystem.

Now, let’s repeat the same exercise by deleting the rows with emp_id < 500

In the above example, you see that the number of pages still remain same after deleting half the records from the table. This means, VACUUM has not released the space to filesystem this time.

As explained earlier, if there are pages with no more live tuples after the high water mark, the subsequent pages can be flushed away to the disk by VACUUM. In the first case, it is understandable that there are no more live tuples after the 3rd page. So, the 4th, 5th and 6th page have been flushed to disk.

However, If you would need to reclaim the space to filesystem in the scenario where we deleted all the records with emp_id < 500, you may run VACUUM FULL. VACUUM FULL rebuilds the entire table and reclaims the space to disk.

Please note that VACUUM FULL is not an ONLINE operation. It is a blocking operation. You cannot read from or write to the table while VACUUM FULL is in progress. We will discuss about the ways to rebuild a table online without blocking in our future blog post.

PREVIOUS POST
NEXT POST

Share this post

Comments (7)

  • Tiago Corcelli Oliveira Reply

    Very very good explain.

    August 8, 2018 at 10:01 am
  • raghavendra Reply

    Very nice explanation. Where can I find the ways to rebuild a table online without blocking .

    August 12, 2018 at 7:56 am
    • avivallarapu Reply

      Thank You Raghavendra. You can rebuild a table online using pg_repack. We would be submitting a blog post on it soon and then add a comment with the link.

      August 13, 2018 at 11:18 am
  • Peter Zaitsev Reply

    Hi Avi,

    This is a good explanation which related to the data. What is about the bloat in the indexes, which I assume also can contain old pointers. How does this play in the picture ?

    August 12, 2018 at 8:08 pm
    • avivallarapu Reply

      Yes, autovacuum/vacuum does take care of Indexes

      We have a hidden column called ctid which is the physical location of the row version within its table. Both Table and its Indexes would have same matching ctid. It may be used as a row identifier that would change upon Update/Table rebuild.

      percona=# CREATE TABLE percona (id int, name varchar(20));
      CREATE TABLE
      percona=# CREATE INDEX percona_id_index ON percona (id);
      CREATE INDEX
      percona=# INSERT INTO percona VALUES (1,’avinash’),(2,’vallarapu’),(3,’avi’),;
      INSERT 0 3
      percona=# SELECT id, name, ctid from percona;
      id | name | ctid
      —-+———–+——-
      1 | avinash | (0,1)
      2 | vallarapu | (0,2)
      3 | avi | (0,3)
      (3 rows)

      percona=# DELETE from percona where id < 3;
      DELETE 2

      After deleting the records, let us see the items inside table/index pages

      Table
      =======
      percona=# SELECT t_xmin, t_xmax, tuple_data_split('percona'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('percona', 0));
      t_xmin | t_xmax | tuple_data_split
      ——–+——–+——————————————-
      3825 | 3826 | {"\\x01000000","\\x116176696e617368"}
      3825 | 3826 | {"\\x02000000","\\x1576616c6c6172617075"}
      3825 | 0 | {"\\x03000000","\\x09617669"}
      (3 rows)

      Index
      =======
      percona=# SELECT * FROM bt_page_items('percona_id_index', 1);
      itemoffset | ctid | itemlen | nulls | vars | data
      ————+——-+———+——-+——+————————-
      1 | (0,1) | 16 | f | f | 01 00 00 00 00 00 00 00
      2 | (0,2) | 16 | f | f | 02 00 00 00 00 00 00 00
      3 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
      (3 rows)

      After running VACUUM, corresponding pointers with same ctid are also removed from Index through a RowExclusiveLock.

      percona=# VACUUM ANALYZE percona;
      VACUUM
      percona=# SELECT t_xmin, t_xmax, tuple_data_split('percona'::regclass, t_data, t_infomask, t_infomask2, t_bits) FROM heap_page_items(get_raw_page('percona', 0));
      t_xmin | t_xmax | tuple_data_split
      ——–+——–+——————————-
      | |
      | |
      3825 | 0 | {"\\x03000000","\\x09617669"}
      (3 rows)

      percona=# SELECT * FROM bt_page_items('percona_id_index', 1);
      itemoffset | ctid | itemlen | nulls | vars | data
      ————+——-+———+——-+——+————————-
      1 | (0,3) | 16 | f | f | 03 00 00 00 00 00 00 00
      (1 row)

      August 13, 2018 at 11:08 am
  • vikramreddyrec1234 Reply

    Hello Avi,
    its good explanation. as you mention “VACUUM does not usually reclaim the space to filesystem unless the dead tuples are beyond the high water mark.”

    –> is there a query to check dead tuples are beyond the high water mark or not?

    August 14, 2018 at 1:18 am
  • Gascard Reply

    Hello avinash,
    Thank you for the explanation, I will follow you . Thierry

    August 14, 2018 at 1:56 pm

Leave a Reply