This week I’ve worked with the customer doing certain work during maintenance window which involved a lot of data copying around between MySQL boxes. We had prepared well and had measured how fast we could copy the data between servers of these kind connected to the same network, and we did the same thing before. Using simple tar+netcat based copy we can get 80-90MB/sec on 1GigE assuming RAID is powerful enough. This applies to large Innodb tables with not overly fragmented tablespace, or it is easy to become IO bound rather than network bound.
As I mentioned you can get even better with using fast parallel compression like LZO or QuickLZ but there was no need in this case.
So estimates were great but once we had started the real copy process we saw the copy speed about 20MB/sec instead of projected 80MB/sec. The IO and CPU usage both on source and target servers was low so it must have been network. Though there was no other traffic between the servers.
The mystery could be easily resolved while looking at network topology. Some database servers were connected to Switch A others to Switch B which had only 1Gbit connection in between.
During the maintenance window multiple tasks concerning different servers made this connection to be the bottleneck.
What does this tell us ? Even if you’re the DBA you better to understand network topology to understand what kind of performance availability and failure scenarios you should expect. If your network is too complicated it is at least worth to know the numbers.
It does not only apply to network but to any resource indeed. For example what if you have the catastrophic event and now would like to restore all 50 servers from backup… in parallel. Will you backup system will be able to restore these in parallel efficiently ? Will there be enough network bandwidth to pipe them through. These and similar questions are what you should be asking yourself.