Following on from our earlier announcement, Paul McCullagh has responded with the answers to your questions – as well as a few I gathered from other Percona folks, and attendees of OpenSQL Camp. Thank you Paul!
Unfortunately it is not possible to point to a specific category of applications and say, “PBXT will be better here, so try it”.Â PBXT is a general purpose transactional storage engine, designed to perform well on a broad range of tasks, much like InnoDB.Â However, PBXT’s log-based architecture makes performance characteristics different to both MyISAM and InnoDB/XtraDB. Tests show that PBXT’s performance is similar to InnoDB but, depending on your database designed and the application, it can be faster.
PBXT is a community project and, of course, we depend on users trying it out. In the long run, this will determine to what extent we are able to continue to develop and improve the engine.Â So, despite this rather vague answer, we are hoping that more people try it out, and work with us to improve the engine as necessary.Â My thanks to all who are already doing this!
This is no longer necessarily the case. For example a test (http://mysqlha.blogspot.com/2009/03/pbxt-is-fast-no-kidding.html) by Mark Callaghan shows that PBXT can actually out perform InnoDB with SELECTs under circumstances.
The implementation of full-durability has changed the performance characteristics of PBXT from “MyISAM-like” to more InnoDB-like. Originally PBXT was conceived as an engine that would be somewhere between MyISAM and InnoDB in both performance and features. The early version of PBXT was not fully durable (equivalent to innodb_flush_log_at_trx_commit=2).
A major change was completed at the beginning of last year with the implementation of full-durability. In doing this it was important to keep the log-based architecture which was the reason for the high write performance of earlier versions.
Traditional transactional implementations suffer from the problem that a backlog of asynchronous writes accumulate until it swamps the engine. There has been a lot of work on both InnoDB and XtraDB to solve this problem. The key words here are fuzzy and adaptive checkpointing (the former, originally implementation by Heiki for InnoDB, and the latter, an excellent addition to XtraDB).
Both methods improve the management of the asynchronous writes. The idea behind the log-based solution, on the other hand, is to avoid the accumulating a backlog of asynchronous writes, but writing synchronously.
Although write performance is comparable with InnoDB, I am not entirely convinced that PBXT’s implementation of the log-based I/O is optimal at this stage. This is ongoing work for PBXT 1.5.
Morgan notes: As well as Adaptive Checkpointing, Oracle has also been working on Adaptive Flushing for the InnoDB Plugin. The engine being ‘swamped’ problem that Paul is referring to is best described visually – see this post for more info.
If you read the white paper from 2006 (http://primebase.org/download/pbxt_white_paper.pdf) you will notice that the original design was uncompromisingly MVCC-based.Â Some of this has been changed to make PBXT more InnoDB-like, but other principles have remained.
Pure-MVCC does not do any locking. Read locks are not required because each transaction effectively gets its own snapshot of the database. And write locks are not acquired when updating. Instead, the application can hit a “optimistic lock error” if a record is updated by another user.
Now PBXT does acquire locks for 2 reasons: to support SELECT FOR UPDATE, and to avoid optimistic locking errors. This makes PBXT’s behavior identical to InnoDB in REPEATABLE READ mode.
On the other hand, there are currently no plans to implement InnoDB style “gap locking”. Gap locking effectively involves locking rows that do not exist. This, in turn, means that PBXT transactions are not SERIALIZABLE.Â A result of this is that statement-based replication is not supported by the engine.
Another hard decision was not to implement clustered indexes which I mentioned in more details later.
A recent version of PBXT (1.0.09) supports the MySQL Backup API which was originally implemented in MySQL 6.0. This feature is now scheduled for an upcoming version of MySQL 5.4.
The Backup API makes it possible to pull a consistent snapshot of an entire database even when tables use different engine types. The API does not yet support incremental backup, but this is planned.
Internally this feature is implemented by PBXT using an MVCC-based consistent snapshot.
PBXT has several system threads, that are responsible for various maintenance tasks. The most important of these are the “Writer”, the “Sweeper” and the “Checkpointer”;
No, currently it does not. This is one of the original design decisions (as raised by a previous question).Â Two things contributed to this decision:
PBXT uses 16K pages for the index data and (approximately) 32K pages for the table data. Both sizes can be set using compile time switches. However, if the index page size is changed, then the indices need to be rebuilt, which can be done by REPAIR TABLE.Â The table data page size does not require a rebuild because a page of records in the table is just a group of records (not an actual fixed length page).
If you are using InnoDB in REPEATABLE READ mode, then there is essentially no difference in the isolation paradigm between the two engines.
REPEATABLE READ is often preferred over SERIALIZABLE mode because it allows a greater degree of concurrency while still providing the necessary transaction isolation. So I do not consider the lack of serializability as a serious deficit in the engine. And, fortunately MySQL 5.1. supports row-based replication which makes it possible to do replication while using REPEATABLE READ.
PBXT does use MVCC to do index scans. Basically this means that all types of SELECTs can be done without locking.
Morgan notes: Indexes not using MVCC is one of the main differences in the Falcon storage engine.
Firstly, PBXT does not acquire read locks. A normal SELECT does not lock at all. In addition, an UPDATE or DELETE only acquires a temporary row-lock. This lock is released when the row is updated or deleted because the MVCC system can detect that a row has been changed (and is therefore write locked).
This means that PBXT does not normally need to maintain long lists of row-level locks. This is also the case when a foreign key leads to cascading operations which can affect thousands of rows.
The only case you need to be aware of is SELECT FOR UPDATE. This operation acquires and holds a row-level lock for each row returned by the SELECT. These locks are all stored in RAM. The format is quite compact (especially when row IDs are consecutive) but this can become an issue if millions of rows are selected in this manner.
I should also mention that the consequence of this is that SELECT … LOCK IN SHARE MODE is currently not supported.
Yes, I think you have mentioned the most important criteria. What I can add to this list are 3 things that make developing a storage engine extremely demanding: performance, stability and data integrity.
Of course, as a DBA or database user these aspects are so basic that they are taken for granted.
But engine developers need to keep performance, stability and data integrity in mind constantly. The problem is, they compete with each other: increasing performance often causes instabilities that then have to be fixed. How to optimize the program without compromising data integrity is a constant question.
Relative to maintaining performance, stability and data integrity, adding features to an engine is easy. So I would say that these are the criteria that concern a developer the most.
Unfortunately the “war” continues. I have already received several e-mails that PBXT does not compile with the recently released MySQL 5.1.41!
Any dot release can lead to this problem, and I think PBXT is fairly moderate with its integration into MySQL.
My main advantage: I have been able to avoid modifying any part of MySQL to make the engine work. This means that PBXT runs with the standard MySQL/MariaDB distribution.
But this has required quite a bit of creative work, in other words, hacks.
One of the main problems has been running into global locks when calling back into MySQL to do things like open a table, create a session structure (THD) or create a .frm file.
One extreme example of this is PBXT recovery. When MySQL calls the engine “init” method on startup it is holding the global LOCK_plugin lock. In init, the engine needs to do recovery. In PBXT’s case this means opening tables (reading a .frm file), which requires creating a THD. The code to create a THD in turn tries to acquire LOCK_plugin!
Unfortunately a thread hangs if it tries to acquire the same mutex twice, so this just does not work!
We went through quite a few iterations (MySQL code was also changing during the development of 5.1) before we came up with the current solution: create a background thread to do recovery asynchronously. So the thread can wait for the LOCK_plugin to be unlocked before it continues.
The affect is that the init function returns quickly, but the first queries that access PBXT tables may hang waiting for recovery to complete.
No, this is not by design.
While I try to only add tuning parameters that are absolutely necessary, PBXT is not specifically designed to be self-tuning, because I believe that is a very hard problem to solve in general.
Tuning parameters are often added to an engine in response to performance problems in particular configurations. This is not necessarily a bad thing because it provides DBA’s with the tools they need.
My goal for PBXT in this regard is twofold:
Morgan notes: There are more in InnoDB/XtraDB now than there were three years ago. This is probably something that emerges over time as we get to understand more about an engine.