After reading Baron’s prediction on databases, here:
I want to provide my own view on what’s coming up next for SQL databases. I think we live in interesting times, when we can see the beginning of the next-generation of RDBMSs.
There are defining characteristics of such databases:
- Auto-scaling. The ability to add and use resources depending on the current load and database size. This is done transparently for users and DBAs.
- Auto-healing. The automatic handling of node failures.
- Multi-regional, cloud-agnostic, geo-distributed. The ability to support multiple data centers and multiple clouds, in different parts of the world.
- Transactional. All the above, with the ability to support multi-statements transactional workloads.
- Strong consistency. The full definition of strong consistency is pretty involved. For simplicity, let’s say it means that reads (in the absence of ongoing writes) will return the same data, despite what region or data center you are getting it from. A simple counter-example is the famous MySQL asynchronous replication, where (with the slave delay) reading the data on a slave can return very outdated data. I am focusing on reads, because in a distributed environment the consistent reads performance will be affected. This is where network latency (often limited by the speed of light) will define performance.
- SQL language. SQL, despite being old and widely criticized, is not going anywhere. This is a universal language for app developers to access data.
With this, I see following interesting projects:
- Google Cloud Spanner (https://cloud.google.com/spanner/). Recently announced and still in the Beta stage. Definitely an interesting projects, with the obvious limitation of running only in Google Cloud.
- FaunaDB (https://fauna.com/). Also very recently announced, so it is hard to say how it performs. The major downside I see is that it does not provide SQL access, but uses a custom language.
- Two open source projects:
- CockroachDB (https://www.cockroachlabs.com/). This is still in the Beta stage, but definitely an interesting project to follow. Initially, the project planned to support only key-value access, but later they made a very smart decision to provide SQL access via a PostgreSQL-compatible protocol.
- TiDB (https://github.com/pingcap/tidb). Right now in RC stages, and the target is to provide SQL access over a MySQL compatible protocol (and later PostgreSQL protocol).
Protocol compatibility is a wise approach, although not strictly necessary. It lowers an entry barrier for the existing applications.
Both CockroachDB and TiDB, at the moment of this writing, still have rough edges and can’t be used in serious deployments (from my experience). I expect both projects will make a big progress in 2017.
What shared characteristics can we expect from these systems?
As I mentioned above, we may see that the read performance is degraded (as latency increases), and often it will be defined more by network performance than anything else. Storage IO and CPU cycles will be secondary factors. There will be more work on how to understand and tune the network traffic.
We may need to get used to the fact that point or small range selects become much slower. Right now, we see very fast point selects for traditional RDBM (MySQL, PostgreSQL, etc.).
Heavy writes will be problematic. The problem is that all writes will need to go through the consistency protocol. Write-optimized storage engines will help (both CockroachDB and TiDB use RocksDB in the storage layer).
The long transactions (let’s say changing 100000 or more rows) also will be problematic. There is just too much network round-trips and housekeeping work on each node, making long transactions an issue for distributed systems.
Another shared property (at least between CockroachDB and TiDB) is the active use of the Raft protocol to achieve consistency. So it will be important to understand how this protocol works to use it effectively. You can find a good overview of the Raft protocol here: http://container-solutions.com/raft-explained-part-1-the-consenus-problem/.
There probably are more NewSQL technologies than I have mentioned here, but I do not think any of them captured critical market- or mind-share. So we are at the beginning of interesting times . . .
What about MySQL? Can MySQL become the database that provides all these characteristics? It is possible, but I do not think it will happen anytime soon. MySQL would need to provide automatic sharding to do this, which will be very hard to implement given the current internal design. It may happen in the future, though it will require a lot of engineering efforts to make it work properly.