Where the open source community meets: Secure your spot for Percona Live Amsterdam! - Register

Downloads

Blog

OpenStack users shed light on Percona XtraDB Cluster deadlock issues

September 11, 2014

Author

Share this Post:

I was fortunate to attend an Ops discussion about databases at the OpenStack Summit Atlanta this past May as one of the panelists. The discussion was about deadlock issues OpenStack operators see with Percona XtraDB Cluster (of course this is applicable to any Galera-based solution). I asked to describe what they are seeing, and as it turned out, nova and neutron uses the SELECT … FOR UPDATE SQL construct quite heavily. This is a topic I thought was worth writing about.

Write set replication in a nutshell (with oversimplification)

Any node is writable, and replication happens in write sets. A write set is practically a row based binary log event or events and “some additional stuff.” The “some additional stuff” is good for 2 things.

- Two write sets can be compared and told if they are conflicting or not.

- A write set can be checked against a database if it’s applicable.

Before committing on the originating node, the write set is transferred to all other nodes in the cluster. The originating node checks that the transaction is not conflicting with any of the transactions in the receive queue and checks if it’s applicable to the database. This process is called certification. After the write set is certified the transaction is committed. The remote nodes will do certification asynchronously compared to the local node. Since the certification is deterministic, they will get the same result. Also the write set on the remote nodes can be applied later because of this reason. This kind of replication is called virtually synchronous, which means that the data transfer is synchronous, but the actual apply is not.

We have a nice flowchat about this.

Since the write set is only transferred before commit, InnoDB row level locks, which are held locally, are not held on remote nodes (if these were escalated, each row lock would take a network round trip to acquire). This also means that by default if multiple nodes are used, the ability to read your own writes is not guaranteed. In that case, a certified transaction, which is already committed on the originating node can still sit in the receive queue of the node the application is reading from, waiting to be applied.

SELECT … FOR UPDATE

The SELECT … FOR UPDATE construct reads the given records in InnoDB, and locks the rows that are read from the index the query used, not only the rows that it returns. Given how write set replication works, the row locks of SELECT … FOR UPDATE are not replicated.

Putting it together

Let’s create a test table.

CREATE TABLE `t` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

CREATE TABLE `t` (

`id` int(11) NOT NULL AUTO_INCREMENT,

`ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,

PRIMARY KEY (`id`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1

And some records we can lock.

pxc1> insert into t values();
Query OK, 1 row affected (0.01 sec)

pxc1> insert into t values();
Query OK, 1 row affected (0.01 sec)

pxc1> insert into t values();
Query OK, 1 row affected (0.01 sec)

pxc1> insert into t values();
Query OK, 1 row affected (0.00 sec)

pxc1> insert into t values();
Query OK, 1 row affected (0.01 sec)

pxc1> insert into t values();

Query OK, 1 row affected (0.01 sec)

pxc1> insert into t values();

Query OK, 1 row affected (0.01 sec)

pxc1> insert into t values();

Query OK, 1 row affected (0.01 sec)

pxc1> insert into t values();

Query OK, 1 row affected (0.00 sec)

pxc1> insert into t values();

Query OK, 1 row affected (0.01 sec)

pxc1> select * from t;
+----+---------------------+
| id | ts                  |
+----+---------------------+
|  1 | 2014-06-26 21:37:01 |
|  4 | 2014-06-26 21:37:02 |
|  7 | 2014-06-26 21:37:02 |
| 10 | 2014-06-26 21:37:03 |
| 13 | 2014-06-26 21:37:03 |
+----+---------------------+
5 rows in set (0.00 sec)

pxc1> select * from t;

+----+---------------------+

| id | ts |

+----+---------------------+

| 1 | 2014-06-26 21:37:01 |

| 4 | 2014-06-26 21:37:02 |

| 7 | 2014-06-26 21:37:02 |

| 10 | 2014-06-26 21:37:03 |

| 13 | 2014-06-26 21:37:03 |

+----+---------------------+

5 rows in set (0.00 sec)

On the first node, lock the record.

pxc1> start transaction;
Query OK, 0 rows affected (0.00 sec)

pxc1> select * from t where id=1 for update;
+----+---------------------+
| id | ts                  |
+----+---------------------+
|  1 | 2014-06-26 21:37:01 |
+----+---------------------+
1 row in set (0.00 sec)

pxc1> start transaction;

Query OK, 0 rows affected (0.00 sec)

pxc1> select * from t where id=1 for update;

+----+---------------------+

| id | ts |

+----+---------------------+

| 1 | 2014-06-26 21:37:01 |

+----+---------------------+

1 row in set (0.00 sec)

On the second, update it with an autocommit transaction.

pxc2> update t set ts=now() where id=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

pxc1> select * from t;
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

pxc2> update t set ts=now() where id=1;

Query OK, 1 row affected (0.01 sec)

Rows matched: 1 Changed: 1 Warnings: 0

pxc1> select * from t;

ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

Let’s examine what happened here. The local record lock held by the started transation on pxc1 didn’t play any part in replication or certification (replication happens at commit time, there was no commit there yet). Once the node received the write set from pxc2, that write set had a conflict with a transaction still in-flight locally. In this case, our transaction on pxc1 has to be rolled back. This is a type of conflict as well, but here the conflict is not caught on certification time. This is called a brute force abort. This happens when a transaction done by a slave thread conflict with a transaction that’s in-flight on the node. In this case the first commit wins (which is the already replicated one) and the original transaction is aborted. Jay Janssen discusses multi-node writing conflicts in detail in this post.

The same thing happens when 2 of the nodes are holding record locks via select for update. Whichever node commits first will win, the other transaction will hit the deadlock error and will be rolled back. The behavior is correct.

Here is the same SELECT … FOR UPDATE transaction overlapping on the 2 nodes.

pxc1> start transaction;
Query OK, 0 rows affected (0.00 sec)

pxc2> start transaction;
Query OK, 0 rows affected (0.00 sec)

pxc1> start transaction;

Query OK, 0 rows affected (0.00 sec)

pxc2> start transaction;

Query OK, 0 rows affected (0.00 sec)

pxc1> select * from t where id=1 for update;
+----+---------------------+
| id | ts                  |
+----+---------------------+
|  1 | 2014-06-26 21:37:48 |
+----+---------------------+
1 row in set (0.00 sec)

pxc2> select * from t where id=1 for update;
+----+---------------------+
| id | ts                  |
+----+---------------------+
|  1 | 2014-06-26 21:37:48 |
+----+---------------------+
1 row in set (0.00 sec)

pxc1> select * from t where id=1 for update;

+----+---------------------+

| id | ts |

+----+---------------------+

| 1 | 2014-06-26 21:37:48 |

+----+---------------------+

1 row in set (0.00 sec)

pxc2> select * from t where id=1 for update;

+----+---------------------+

| id | ts |

+----+---------------------+

| 1 | 2014-06-26 21:37:48 |

+----+---------------------+

1 row in set (0.00 sec)

pxc1> update t set ts=now() where id=1;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

pxc2> update t set ts=now() where id=1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

pxc1> update t set ts=now() where id=1;

Query OK, 1 row affected (0.01 sec)

Rows matched: 1 Changed: 1 Warnings: 0

pxc2> update t set ts=now() where id=1;

Query OK, 1 row affected (0.00 sec)

Rows matched: 1 Changed: 1 Warnings: 0

pxc1> commit;
Query OK, 0 rows affected (0.00 sec)

pxc2> commit;
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

pxc1> commit;

Query OK, 0 rows affected (0.00 sec)

pxc2> commit;

ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

Where does this happen in OpenStack?

For example in OpenStack Nova (the compute project in OpenStack), tracking the quota usage uses the SELECT…FOR UPDATE construct.

# User@Host: nova[nova] @  [10.10.10.11]  Id:   147
# Schema: nova  Last_errno: 0  Killed: 0
# Query_time: 0.001712  Lock_time: 0.000000  Rows_sent: 4  Rows_examined: 4  Rows_affected: 0
# Bytes_sent: 1461  Tmp_tables: 0  Tmp_disk_tables: 0  Tmp_table_sizes: 0
# InnoDB_trx_id: C698
# QC_Hit: No  Full_scan: Yes  Full_join: No  Tmp_table: No  Tmp_table_on_disk: No
# Filesort: No  Filesort_on_disk: No  Merge_passes: 0
#   InnoDB_IO_r_ops: 0  InnoDB_IO_r_bytes: 0  InnoDB_IO_r_wait: 0.000000
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 2
SET timestamp=1409074305;
SELECT quota_usages.created_at AS quota_usages_created_at, quota_usages.updated_at AS quota_usages_updated_at, quota_usages.deleted_at AS quota_usages_deleted_at, quota_usages.deleted AS quota_usages_deleted, quota_usages.id AS quota_usages_id, quota_usages.project_id AS quota_usages_project_id, quota_usages.user_id AS quota_usages_user_id, quota_usages.resource AS quota_usages_resource, quota_usages.in_use AS quota_usages_in_use, quota_usages.reserved AS quota_usages_reserved, quota_usages.until_refresh AS quota_usages_until_refresh
FROM quota_usages
WHERE quota_usages.deleted = 0 AND quota_usages.project_id = '12ce401aa7e14446a9f0c996240fd8cb' FOR UPDATE;

# User@Host: nova[nova] @ [10.10.10.11] Id: 147

# Schema: nova Last_errno: 0 Killed: 0

# Query_time: 0.001712 Lock_time: 0.000000 Rows_sent: 4 Rows_examined: 4 Rows_affected: 0

# Bytes_sent: 1461 Tmp_tables: 0 Tmp_disk_tables: 0 Tmp_table_sizes: 0

# InnoDB_trx_id: C698

# QC_Hit: No Full_scan: Yes Full_join: No Tmp_table: No Tmp_table_on_disk: No

# Filesort: No Filesort_on_disk: No Merge_passes: 0

# InnoDB_IO_r_ops: 0 InnoDB_IO_r_bytes: 0 InnoDB_IO_r_wait: 0.000000

# InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000

# InnoDB_pages_distinct: 2

SET timestamp=1409074305;

SELECT quota_usages.created_at AS quota_usages_created_at, quota_usages.updated_at AS quota_usages_updated_at, quota_usages.deleted_at AS quota_usages_deleted_at, quota_usages.deleted AS quota_usages_deleted, quota_usages.id AS quota_usages_id, quota_usages.project_id AS quota_usages_project_id, quota_usages.user_id AS quota_usages_user_id, quota_usages.resource AS quota_usages_resource, quota_usages.in_use AS quota_usages_in_use, quota_usages.reserved AS quota_usages_reserved, quota_usages.until_refresh AS quota_usages_until_refresh

FROM quota_usages

WHERE quota_usages.deleted = 0 AND quota_usages.project_id = '12ce401aa7e14446a9f0c996240fd8cb' FOR UPDATE;

So where does it come from?

These constructs are generated by SQLAlchemy using with_lockmode(‘update’). Even in nova’s pydoc, it’s recommended to avoid with_lockmode(‘update’) whenever possible. Galera replication is not mentioned among the reasons to avoid this construct, but knowing how many OpenStack deployments are using Galera for high availability (either Percona XtraDB Cluster, MariaDB Galera Cluster, or Codership’s own mysql-wsrep), it can be a very good reason to avoid it. The solution proposed in the linked pydoc above is also a good one, using an INSERT INTO … ON DUPLICATE KEY UPDATE is a single atomic write, which will be replicated as expected, it will also keep correct track of quota usage.

The simplest way to overcome this issue from the operator’s point of view is to use only one writer node for these types of transactions. This usually involves configuration change at the load-balancer level. See this post for possible load-balancer configurations.