This Week in Data with Colin Charles 27: Percona Live Tutorials Released and a Comprehensive Review of the FOSDEM MySQL DevRoomColin Charles
Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.
Percona Live Santa Clara 2018 update: tutorials have been announced. The committee rated over 300+ talks, and easily 70% of the schedule should go live next week as well. In practice, then, you should see about 50 talks announced next week. There’s been great competition: we only have 70 slots in total, so about 1 in 5 talks get picked — talk about a competitive ratio.
FOSDEM was truly awesome last week. From a Percona standpoint, we had a lot of excellent booth traffic (being outside of the PostgreSQL room on Saturday, and not too far out from the MySQL room on Sunday). We gave away bottle openers — useful in Brussels with all the beer; we tried a new design with a magnet to attach it to your fridge — stickers, some brochures, but most of all we had plenty of great conversations. There was quite a crowd from Percona, and it was excellent to see the MySQL & Friends DevRoom almost constantly full! A few of us brave souls managed to stay there the whole day, barely with any breaks, so as to enjoy all the talks.
I find the quality of talks to be extremely high. And when it comes to a community run event, with all content picked by an independent program committee, FOSDEM really sets the bar high. There is plenty of competition to get a good talk in, and I enjoyed everything we picked (yes, I was on the committee too). We’ve had plenty of events in the ecosystem that sort of had “MySQL” or related days, but FOSDEM might be the only one that has really survived. I understand we will have a day of some sort at SCALE16x, but even that has been scaled down. So if you care about the MySQL ecosystem, you will really want to ensure that you are at FOSDEM next year.
This year, we started with the usual MySQL Day on Friday. I could not be present, as I was at the CentOS Dojo, giving a presentation. So, the highlight of Friday for me? The community dinner. Over 80 people showed up, I know there was a waiting list, and lots of people were trying to get tickets at the last minute. Many missed out too; sorry, better luck next year; and also, hopefully, we will get a larger venue going forward. I really thank the organizers for this — we affectionately refer to them as the Belconians (i.e. a handful of Perconians based in Belgium). The conversation, the food, the drink — they were all excellent. It’s good to see representation from all parts of the community: MySQL, Percona, MariaDB, Pythian, and others. So thank you again, Liz, Dimitri, Tom, and Kenny in absentia. I think Tjerk also deserves special mention for always helping (this year with the drinks)
As for FOSDEM itself, beyond the booth, I think the most interesting stuff was the talks. There are video recordings and slides of pretty much all talks, but I will also give you the “Cliff’s Notes” of them here.
MySQL DevRoom talk quick summaries
Beyond WHERE and GROUP BY – Sergei Golubchik
- EXCEPT is in MariaDB Server 10.3
- recursive CTEs are good for hierarchical data, graphs, data generation, Turing complete (you can use it to solve Sudoku even)
- non-recursive CTEs can be an alternative syntax for subqueries in the FROM clause
- Window functions:
- Normal: one result per row, depend on that row only
- Aggregate: one result per group, depending on the whole group
- Window: one result per row, depending on the whole group
- System versioned tables with AS OF
- Aggregate stored functions
MySQL 8.0 Performance: InnoDB Re-Design – Dimitri Kravtchuk
- Contention-Aware Transactions Scheduling (CATS), since 8.0.3. Not all transactions are equal, FIFO could not be optimal, unblock the most blocking transactions first
- CATS (VATS) had a few issues, and there were bugs (they thought everything worked since MariaDB Server had implemented it). They spent about 9 months before fixing everything.
- Where does CATS help? Workloads hitting row lock contentions. You can monitor via SHOW ENGINE INNODB MUTEX.
- the main problem is because of repeatable read versus read committed transaction isolation on the same workload. You really need to understand your workload when it comes to VATS.
MySQL 8.0 Roles – Giuseppe Maxia
- Created like a user, granted like privileges. You need to activate them to use them.
- Before roles, you created a user, then grant, grant, and more grant’s… Add another user? Same deal. Lots of repetitive work and a lot of chances to make mistakes.
- Faster user administration – define a role, assign it many times. Centralized grant handling – grant and revoke privileges to roles, add/edit all user profiles.
- You need to remember to set the default role.
- A user can have many roles; default role can be a list of roles.
- Roles are users without a login – roles are saved in user tables. This is useful from an account lock/unlock perspective.
- You can grant a user to a user
- SET ROLE is for session management; SET DEFAULT ROLE is a permanent assignment of a role for a user. SET ROLE DEFAULT means assign the default role for this user for this session
- The role_edges table reports which roles are assigned to which users. default_roles keeps track of the current default roles assigned to users. A default role may not exist.
Histogram support in MySQL 8.0 – Øystein Grøvlen
- You can now do ANALYZE TABLE table UPDATE HISTOGRAM on column WITH n BUCKETS;
- New storage engine API for sampling (default implementation is full table scan even when sampling)
- Histogram is stored in a JSON column in the data dictionary. Grab this from the INFORMATION_SCHEMA.
- Histograms are useful for columns that are not the first column of any index, and used in WHERE conditions of JOIN queries, queries with IN-subqueries, ORDER BY … LIMIT queries. Best fit: low cardinality columns (e.g. gender, orderStatus, dayOfWeek, enums), columns with uneven distribution (skew), stable distribution (do not change much over time)
- How many buckets? equi-height, 100 buckets should be enough.
- Histograms are stored in the data dictionary, so will persist over restarts of course.
Let’s talk database optimizers – Vicențiu Ciorbaru
- Goal: produce a query plan that executes your query in the fastest time possible.
- Condition pushdown through PARTITION BY, there is also a comparison with PostgreSQL
- Split grouping for derived optimization only in MariaDB 10.3
- Read more at splitgroupingderived=on — https://github.com/MariaDB/server/commit/b14e2b044b (and the equivalent MDEV: https://jira.mariadb.org/browse/MDEV-13369)
TLS for MySQL at Large Scale – Jaime Crespo
- Literally took 3 lines in the my.cnf to turn on TLS
- They wanted to do a data centre failover and wanted to ensure replication would be encrypted.
- They didn’t have proper orchestration in place (MySQL could have this too). Every time OpenSSL or MySQL had to be upgraded, the daemon needed restarting. If there was an incompatible change, you had to sync master/replicas too.
- The automation and orchestration that Wikipedia uses: https://fosdem.org/2018/schedule/event/cumin_automation/ (it is called Cumin: https://wikitech.wikimedia.org/wiki/Cumin)
- Server support was poor – OpenSSL – so they had to deploy wmf-mysql and wmf-mariadb of their own
- Currently using MariaDB 10.0, and looking to migrate to MariaDB 10.1
- Client library pain they’ve had
- TLSv1.2 from the beginning (2015).
- 20-50x slower for actual connecting; the impact is less than 5% for the actual query performance. Just fix client libraries, make them use persistent connections. They are now very interested in ProxySQL for this purpose.
- Monty asks, would a double certificate help? Jaime says sure. But he may not actually use double certificates; might not solve CA issues, and the goal is not to restart the server.
- Monty wonders why not to upgrade to 10.2? “Let’s talk outside because it’s a much larger question.”
MySQL InnoDB Cluster – Miguel Araújo
- group replication: update everywhere (multi-master), virtually synchronous replication, automatic server failover, distributed recovery, group reconfiguration, GCS (implementation of Paxos – group communication system). HA is a critical factor.
- mysqlsh: interactive and batch operations. Document store (CRUD and relational access)
- Usability. HA out of the box.
- It’s easy to join a new node; new node goes into recovery mode (and as long as you have all the binary logs, this is easy; otherwise start from a backup)
- SET PERSIST – run a command remotely, and the configuration is persisted in the server
- Network flapping? Group replication will just reject the node from the cluster if its flapping too often
Why we’re excited about MySQL 8 – Peter Zaitsev
- Native data dictionary – atomic, crash safe, DDLs, no more MyISAM system table requirements
- Fast INFORMATION_SCHEMA
- utf8mb4 as default character set
- Security: roles, breakdown of SUPER privileges, password history, faster cached-SHA2 authentication (default), builds using OpenSSL (like Percona Server), skip grants blocks remote connections, logs now encrypted when tablespace encryption enabled
- Persistent AUTO_INCREMENT
- auto-managed undo tablespaces – do not use system table space for undo space. Automatically reclaim space on disks.
- Self-tuning, limited to InnoDB (innodb_dedicated_server to auto-tune)
- partial in-place update for JSON – update filed in JSON object without full rewrite. Good for counters/statuses/timestamps. Update/removal of element is supported
- Invisible indexes – test impact of dropping indexes before actually dropping them. Maintained but unused by the optimizer. If not needed or used, then drop away.
- TmpTable Storage Engine – more efficient storage engine for internal temporary tables. Efficient storage for VARCHAR and VARBINARY columns. Good for GROUP BY queries. Doesn’t support BLOB/TEXT columns yet (this reverts to InnoDB temp table now)
- Backup locks – prevent operations which may result in inconsistent backups. CHECK INSTANCE FOR BACKUP (something Percona Server has had before)
- Optimizer histograms – detailed statistics on columns, not just indexes
- improved cost model for the optimizer – www.unofficialmysqlguide.com
- Performance schematic – faster (via “fake” indexes), error instrumentation, response time histograms (global & per query), digest summaries
- select * from sys.session – fast potential replacement for show processlist
- RESTART (command)
- SET PERSIST – e.g. change the buffer pool size, and this helps during a restart
- assumes default storage is SSD now
- binary log on by default, log_slave_updates enabled by default, and log expires after 30 days by default
- query cache removed. Look at ProxySQL or some other caching solution
- native partitioning only – remove partitions from MyISAM or convert to InnoDB
- resource groups – isolation and better performance (map queries to specific CPU cores; can jail your costly queries, like analytical queries)
- Feature Requests: better single thread performance, no parallel query support
MySQL Test Framework for Support and Bugs Work – Sveta Smirnova
- MTR allows you to add multiple connections
- has commands for flow control
ProxySQL – GTID Consistent Reads – René Cannaò, Nick Vyzas
- threshold is configurable in increments of 1 second. Replication lag can be monitored with ProxySQL. Want to ensure you don’t have stale reads.
- Why is GTID important? To guarantee consistently. Auto positioning for restructuring topologies.
- –session-track-gtids is an important feature which allows sending the GTID for a transaction on the OK packet for a transaction. Not available in MariaDB.
- There is a ProxySQL Binlog Reader now – GTID information about a MySQL server to all connected ProxySQL instances. Lightweight process to run on your MySQL server.
- ProxySQL can be configured to enforce GTID consistency for reads on any hostgroup/replication hostgroup.
- Live demo by René
Turbocharging MySQL with Vitess – Sugu Sougoumarane
- trend for the cloud: container instances, short-lived containers, tolerate neighbors, discoverability. No good tools yet for Kubernetes.
- non-ideal options: application sharing, NoSQL, paid solutions, NewSQL (CockroachDB, TiDB, Yugabyte)
- Vitess: leverage MySQL at massive scale, opensource, 8+ years of work, and multiple production examples
- Square uses Vitess for Square Cash application.
- Can MySQL run on Docker? Absolutely, many of the companies do huge QPS on Docker.
- YouTube does a major re-shard every 2-3 months once. No one notices nowadays when that happens.
- app server connects to vtgate, and only underneath it’s a bunch of smaller databases with vttablet + mysqld. The lockserver is what makes it run well in the cloud.
- pluggable architecture with no compromise on performance: monitoring, health check, ACLs, tracing, more.
- at most, it adds about 2ms overhead to connections
- Go coding standards are enforced, unit tests with strict coverage requirements, end-to-end tests, Travis, CodeClimate and Netlify. Readability is king.
- On February 5 2018, it will be a CNCF project. One year of due diligence. They said there was nothing to compare it with. Looked at maturity and contributors. It’s becoming a truly community-owned project! (CNCF to Host Vitess is already live as of now)
- roadmap: full cross-shard queries, migration tools, simplify configurability, documentation.
- full MySQL protocol, but a limited query set – they want to get it to a point where it accepts a full MySQL query.
Orchestrator on Raft – Shlomi Noach
- Raft: guaranteed to be in-order replication log, an increasing index. This is how nodes choose a leader based on who has the higher index. Get periodic snapshots (node runs a full backup).
- HashiCorp raft, a Golang raft implementation, used by Consul
- orchestrator manages topology for HA topologies; also want orchestrator to be highly available. Now with orchestrator/raft, remove the MySQL backend dependency, and you can have data center fencing too. Now you get: better cross-DC deploys, DC-local KV control, and also Kubernetes friendly.
- n-orchestrator nodes, each node still runs its own backend (either MySQL or SQLite). Orchestrator provides the communication for SQLite between the nodes. Only one (the Raft leader) will handle failovers
- implementation & deployment @ Github – one node per DC (deployed at 3 different DCs). 1-second raft polling interval. 2 major DCs, one in the cloud. Step-down, raft-yield, SQLite-backed log store, and still a MySQL backend (SQLite backend use case is in the works)
- They patched the HashiCorp raft library. The library doesn’t care about the identity of nodes, with Github they do want to control the identity of the leader. There is an “active” data center, and locality is important. This is what they mean by raft-yield (picking a candidate leader).
- The ability for a leader to step down is also something they had to patch.
- HashiCorp Raft only supports LMDB and another database, so the replication log is now kept in a relational SQLite backed log store. Another patch.
- once orchestrator can’t run its own self-health check, it recognizes this. The application can tell raft now that it’s stepping down. Takes 5 seconds to step down, and raft then promotes another orchestrator node to be the leader. This is their patch.
- can also grab leadership
- DC fencing handles network partitioning.
- orchestrator is Consul-aware. Upon failover, orchestrator updates Consul KV with the identity of the promoted master.
- considerations to watch out for: what happens if, upon replay of the Raft log, you hit two failovers for the same cluster? NOW() and otherwise time-based assumptions. Reapplying snapshot/log upon startup
- roadmap: use Kubernetes (cluster IP based configuration in progress, already container friendly via auto-re-provisioning of nodes via Raft)
MyRocks Roadmaps – Yoshinori Matsunobu
- Facebook has a large User Database (UDB). Social graph, massively sharded, low latency, automated operations, pure flash storage (constrained by space, not CPU/IOPS)
- They have a record cache in-front of MySQL – Tao for reads. If cache misses, then it hits the database. And all write requests go thru MySQL. UDB has to be fast to ensure a good user experience.
- they also at Facebook run 2 instances of MySQL on the same machine, because CPU wasn’t huge, but the space savings were awesome.
- design decisions: clustered index (same as InnoDB), slower for reads, faster for writes (bloom filters, column family), support for transactions including consistency between binlog and MyRocks. Faster data loading/deletes/replication, dynamic options (instead of having to restart mysqld), TTL (comparable to HBase TTL feature, specify the TTL, any data older than time, can be removed), online logical (for recovery purposes) & binary backup (for creating replicas)
- Pros: smaller space, better cache hit rate, writes are faster so you get faster replication, much smaller bytes written
- Cons: no statement based replication, GAP locks, foreign keys, full-text index, spatial index support. Need to use case sensitive collations for performance. Reads are slower, especially if the data fits in memory. Dependent on file system and OS; lack of solid direct I/O (uses buffered I/O). You need a newer than 4.6 kernel. Too many tuning options beyond buffer pool such as bloom filter, compactions, etc.
- Completed InnoDB to MyRocks migration. Saved 50% space in UDB compared to compressed InnoDB.
- Roadmaps: getting in MariaDB and Percona Server for MySQL. Read Mark’s blog for matching read performance vs InnoDB. Supporting mixed engines. Better replication and bigger instance sizes.
- mixed engines: InnoDB and MyRocks on the same instance, though single transaction does not overlap engines. Plan to extend star backup to integrate `myrocks_hotbackup. Backport gtid_pos_auto_engines from MariaDB?
- Removing engine log. Could be caused by binlog and engine log, which requires 2pc and ordered commits. Use one log? Either binlog or binlog like service or RocksDB WAL? Rely on binlog now (semi-sync, binlog consumers), need to determine how much performance is gained by stopping writing to WAL.
- Parallel replication apply is important in MySQL 8
- support bigger instance sizes: shared nothing database is not a general purpose database. Today you can get 256GB+ RAM and 10TB+ flash on commodity servers. Why not run one big instance and put everything there? Bigger instances may help general purpose small-mid applications. Then you don’t have to worry about sharing. Atomic transactions, joins and secondary keys will just work. Amazon Aurora today supports a 60TB instance!
- today: you can start deploying slaves with consistency check. Many status counters for instance monitoring.
ProxySQL internals – René Cannaò
- reduce latency, scales, maximize throughput. Single instance to travel hundreds of thousands of connections and to handle thousands of backend servers.
- threading models: one thread per connection (blocking I/O), thread pooling (non-blocking I/O, scalable).
- ProxySQL thread pool implementation: known as “MySQL threads”, fixed number of worker threads (configurable), all threads listen on the same port(s), client connections are not shared between threads, all threads perform their own network I/O, and it uses poll() (does that scale? True, but there is a reason why poll over epoll)
- threads never share client connections – no need for synchronization, thread contention is reduced, each thread calls poll(). Possibly imbalanced load as a con (one thread that has way more connections that another). Is it really a problem? Most of the time, no, connections will automatically balance.
- poll() is O(N), epoll() is O(1). Poll() is faster than epoll() for fewer connections (around 1000). Performance degrees when there are a lot of connections. So by default, it uses poll() instead of epoll(), around 50,000 connections performance degrades badly – so ProxySQL has auxiliary threads.
- MySQL_Session() is implemented as a state machine. Stores metadata associated with the client session (running timers, default hostgroup, etc.)
MySQL Point-in-time recovery like a rockstar – Frederic Descamps
- to do PITR, you need a decent backup!
- manual tells you about PITR – read it well
- Percona Monitoring and Management 1.7.0 (PMM) – This release features improved support for external services, which enables a PMM Server to store and display metrics for any available Prometheus exporter. For example, you could deploy the postgres_exporter and use PMM’s external services feature to store PostgreSQL metrics in PMM. Immediately, you’ll see these new metrics in the Advanced Data Exploration dashboard. Then you could leverage many of the pre-developed PostgreSQL dashboards available on Grafana.com, and with a minimal amount of edits have a working PostgreSQL dashboard in PMM!
- MariaDB Server 10.1.31 – usual updates to storage engines, and a handful of bug fixes.
- Amazon Redshift Spectrum – are you using this yet?
- Database sharding explained in plain English
- Modern SQL Window Function Questions – a quiz that’s quite poignant (MariaDB 10.2+ has window functions; upcoming MySQL 8 has it too).
- SCALE16x – Pasadena, California, USA – March 8-11 2018
I look forward to feedback/tips via e-mail at firstname.lastname@example.org or on Twitter @bytebot.