Abstract: Storage engine algorithmic gains have mostly settled and Moore’s law for CPU speed is bottoming out, but database performance still stands to increase ~10x thanks to continuing NAND Flash improvement, Optane, Flash-idiomatic SSD drives (e.g. OpenChannel, Zoned Namespaces), KeyValue SSD, etc. The downside is each storage engine revolution historically has broken encapsulation/reuse and required DBMS developers to remake backup and restore, high availability, diagnostics, etc. This article aims to create a checklist of necessary features that should be in new db-accelerating disks before we buy into them.
Over the last two decades, the performance of databases has increased by approximately 100-fold per server. Part of that ~x100 was the software improvements but mostly it is thanks to better hardware, especially the HDD -> Flash SSD upgrade.
There are more upcoming gifts from the silicon heavens that will soon improve database performance ~x10, and maybe even another x10 on top of that by the time they’re done.
- NAND Flash is still improving well, nearly two decades after it first took off. (I recall SSD just resolved so many db performance problems by, say, 2013, that I stopped paying attention to it. It got about 10 times faster since then, whilst I wasn’t looking.) Plus there is Optane, as memory or disk.
- Improving mainboard buses. Not just doubling bit rate speeds again and again, but also affording increased decentralization of processing and memory access.
- SSD drives that provide in-situ processing, offloading work from the CPU to the drive.
- Increasing commercial solutions to aggregate storage and memory in low-latency networks.
- The next-next gen permanent memory technologies coming after (looking to be MRAM so far).
These will have performance benefits for all sorts of software, but for databases let’s summarize it roughly as:
- A ~x10 general performance increase over the next couple of years (i.e. 2020 ~ 2022) for software that makes use of new storage, whether it be a new-style NVM drive or non-volatile memory in a DIMM.
- Drives that have in-situ processing capabilities eliminate most database stalls that we currently hit during high write load due to LSM compaction or WiredTiger checkpoints etc.
- NoSQL and NewSQL databases famously solved the web-scale size issue with sharding, but it may become irrelevant for 90%+ of enterprise-size users as fast NVM drives grow > 10TB, and can furthermore be aggregated to PB sizes in low-latency networks.
- The next-next generation of non-volatile memory, the one that supersedes NAND Flash and current-gen Optane, is looking as though it will be x10 faster again. I.e. ~x100 better latency compared to our current-day performance.
All the new storage devices above can implement a key-value datastore and, depending on a few other factors, can probably be wrapped as a storage engine for DBMSes like MySQL or MongoDB.
“Another victory just around the corner” you might think, “thank goodness current-generation DBMSes are built in the paradigm of code reuse and encapsulation”.
The Problem for the Database Industry
The evolutionary paths opening up now are quite numerous. The choices for the storage engine developers at the moment are:
- Streaming/Zoned Namespace SSDs vs App-controlled FTL v.s. KeyValue-SSD interfaces
- Option: Add an FPGA (or ASIC) to do simple-but-compute intensive near-data processing on the flash drive rather than going back and forth to the OS kernel.
- Competing kernel APIs (e.g. async not classic synchronous disk access)
- Competing bus protocols
- The fancy NIC option: Go directly to network (i.e. skipping kernel) with network-attached NVMe and DMA products to do asynchronous replication or other data-sync/backup functions.
- Create indexes only ephemerally in memory, or persist them.
- BTree or LSM or other datastructure for tables and indexes
- Transactions: support multi-statement transactions or just atomic, single-record updates.
- Security: New API regarding ownership and access will be needed if the drives move away from filesystem data objects.
That is now. When it gets to the next generation of persistent memory it will all start again. There will be a new userspace interface, new kernel drivers, new FPGA/ASICs on the pluggable drive/memory, etc.
The storage solution providers will provide the above, so it might seem as though we just have to be patient for now and then, when the winning solutions arrive, retrain ourselves to learn new performance analysis and tuning techniques.
But with industry history as our guide the new, plug-in database storage engines being delivered to the market will only be complete for the single-server database. I.e. I predict they will ship the product out the door without the following.
Predicted Gap 1 – Hot Backup (+ replica initialization)
I think storage engine developers forget that as a business-imposed rule basically all database deployments have two external systems that are supposed to capture everything. Taking backups and having replicasets is how we user-proof and power-failure-proof the database.
To make this efficient the new storage engines should have the following to get the data out efficiently to external backup, or replicas:
- An API to read a consistent snapshot to make a backup / initial copy for replica sync.
- An API to serve a cursor, or some kind of read pointer, to fetch incremental write operations log from an arbitrary time in the past (typically the last backup snapshot time or replica initialization start time) to another arbitrary point in time.
Fast backup with consistency isn’t just a nice-to-have; higher write throughput and sizes > 10TB on a single drive will need to be accommodated by the time this technology becomes commonplace in datacenters. Slower, up-the-stack backup methods like classic database table dumps won’t finish within reasonable times for the new middle-size, let alone big-size, databases.
The hot backups we have today are ‘bolt-on after’ systems. It wasted a lot of developer and DBA time, and fragmentation between the alternatives continues to divide engineering hours and hence the progress the industry makes. So let’s make hot backup and efficient replication an initial requirement this time.
Predicted Gap 2 – Comprehensive Diagnostic Interfaces
Storage engines (e.g. InnoDB or WiredTiger) might be replaced by drives that do basically all the indexing, consistency guarantee, clean-up processing, etc., on the drive without any execution path involving the db software on the CPU. The internal engine metrics current storage engines include should be matched with equivalents reporting activity within in-situ processing drives.
DBAs need it for monitoring, and engineers examining defects in the field are blind without them.
Predicted Gap 3 – Access Control
With the existing file-based databases the “mongod” or “mysqld” or “postgres” etc. user owns the database file. “chown -R dbuser:dbgroup /data/db; chmod 660 /data/db/files*”. Unless you subvert it, accidentally or deliberately, other (non-root) processes won’t have access to the db records in those files.
There are some solutions that will provide a Key-value API by NVMe commands without having any filesystem though.
How does access control work when they are no files? How does, say, a Key-value SSD discriminate between user processes?
So far, as my current reading goes, the consensus-finding for that hasn’t started in earnest yet.
I am quite eager to get started with the new NVM, whether an SSD drive or permanent memory in DIMMS, to have storage engines that will put pre-2020 state-of-the-art database performance to shame.
But at the same time, to avoid needless double-work across the database industry, I don’t think we should put these drives to use in our general-purpose DBMSes until they natively support all the following:
- An MVCC table structure that affords snapshot isolation
- Fast cloning of a consistent snapshot
- Incremental log
- Near-data processing for LSM Compaction, BTree trimming, etc.
- A clear and simple access control model
- Diagnostic info (particularly counters and latency sums) on par with the detail we get from MySQL performance_schema or MongoDB’s serverStatus.