October 21, 2014

Working with many files and file system fragmentation

Working on performance optimization project (not directly MySQL related) we did a test – creating 100 files writing 4K in the random file for some time and when checking the read speed on the files we end up with, compared to just writing the file sequentially and reading it back.

The performance difference was huge – we could read single file at 80MB/sec while fragmented files only deliver about 2MB/sec – this is a massive difference.

The test was done on EXT3 and it looks like it does not do very good job preventing file fragmentation for large amount of growing files.

It would be interesting to see how other filesystems deal with this problem, for example XFS with delayed allocation may be doing better job.

I also would like to repeat the test with MySQL MyISAM tables and see how bad the difference would be for MySQL but I would expect something along those lines.

Interesting enough it should not be that hard to fix this issue – one could optionally preallocate MyISAM tables in some chunks (say 1MB) so its gets less fragmentation. Though it would be interesting to benchmark how much such approach would generally help.

Until we have this feature – reduced fragmentation is one more benefit we get with batching. For example instead of inserting rows one by one in large number of tables once can be buffered in memory (application or MyISAM memory table) and flushed to the actual tables in bulks.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. I’ve been working with xfs a lot on the storage servers and fragmentation has been a huge problem with it too, though xfs prides itself in not needing defragmentation (at least as much as some other filesystems). Until fragmentation was sorted out, the load on the box would spike to 100+ during an otherwise reasonable load. I’ve used xfs_fsr to defrag and ended up cronning it to run for 30 seconds every 15 minutes. I suspect the issue may be related to nfs3 writing to it in relatively small chunks (as it’s a stateless protocol) and not xfs itself.

    As far as the benchmarks above, I’m curious to see a similar one with reiser, if you find time and ext4 when it comes out.

  2. Bill says:

    It would also be interesting to repeat the test with the files on those new solid state drives, to see how much of an improvement you could get.

  3. peter says:

    Oh yes. SSD reduce seek penalty dramatically so fragmentation should be much less issue for them.

    This is multiple Terabyte data storage project so it is not for SSD yet.

  4. brian says:

    Are you sure the speed difference is caused by fragmentation of the filesystem on the disk, and not due to the single large file being in the buffer cache? It might be interesting to use the following to see what’s in the buffer cache:

    http://net.doit.wisc.edu/~plonka/fincore/

    http://insights.oetiker.ch/linux/fadvise.html

  5. tgabi says:

    I’ve noticed this long time ago. I’m fighting this using different measures: a. small tables are optimized periodically b.big tables that are affected by fragmentation can be arranged as 1 file per filesystem c. recently Mysql 5.1 table partitioning allowed to use SSD for most recent data (most used) and disk for historical data (less used). Index files are the most prone to fragmentation – since 1 record inserted/updated can trigger updates on several indices. No filesystem is immune to fragmentation unfortunately, the only way to reduce it is to have some table options that will pre-allocate space in large chunks (like 1GB data, 1 GB index).

  6. peter says:

    Thanks tgabi,

    If you have updates you may be suffering from internal fragmentation in addition to file system fragmentation – such as rows may get more than one piece in MySQL sequential pages in index order happen to be in different locations etc.

    Good you agree on pre allocation option though I think even much smaller values such as 1MB or 4MB would make things much better.

  7. peter says:

    Brian,

    Thanks for heads up I’ve written another blog post on those tools.

    I’m quite sure that was not caching because number at VMSTAT matches these quite closely.

  8. Markus says:

    Be aware if you use XFS/ReiserFS on PC-Hardware, you risk loss of data in case of power failure:

    http://zork.net/~nick/mail/why-reiserfs-is-teh-sukc

  9. Some time ago (2006) I noticed that even on a fully defragged NTFS drive newly created InnoDB data files got scattered around the whole partition.
    http://jroller.com/dschneller/entry/strange_innodb_fragmentation

  10. peter says:

    Daniel,

    I’m not quite sure from your post – did you create the tablespace out of many files and it was already badly fragmented or it was autoextend during some workload run ?

    In any case we can learn filesystems may not be as optimal as we would like them to be :)

Speak Your Mind

*