Announcement

Announcement Module
Collapse
No announcement yet.

Table fragmentation and SSD's

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • Table fragmentation and SSD's

    Had an interesting thought today. First let me talk a bit about our infrastructure.

    We have a largeish table with about 60GB of data. THis data changes often and is reloaded from csv files using loaddata. We currently load the new dataset into a second table, then rename the two and delete the old one. This is a pain i nthe ass for a few reasons, but mostly because it means that suddenly none of our data is in memory, so mysql has to rebuild the entire innodb bufferpool when we make this switch. This of course makes our app VERY slow to respond for a half hour or so.

    Our entire dataset fits nicely in memory, and after it is loaded, it is entirely read only. We only run selects against it.

    So, my question is, instead of doing what we do now, if we are using SSD's AND the entire dataset fits in memory, would we possibly be able to load the new data over the current data so the records are updated, then delete the records that are no longer used? The ONLY reason we don't do that now is because I didn't want the table to become fragmented, but using SSD's and keeping the dataset in memory, fragmentation shouldn't matter. You don't defrag an SSD on a desktop system for example because it wears the drive out faster, and access time is the same no matter where you are reading from the disk.

    Am I onto something here, or completely off base?

  • #2
    Have you proven that fragmentation is a bad thing?

    Comment


    • #3
      I think it's generally accepted that table fragmentation is a bad thing, especially for spinning disks. They are only reasonably good with sequential I/O, so access times start to add up and slow things down. My question is basically whether or not anyone has experienced the same with in memory tables and or those stored on solid state drives. Neither of which should be negatively impacted by random i/o (unless of course you are nearing your maximum i/o for that device)

      Comment


      • #4
        Your data fits in memory, so spinning disks should not be part of your argumentation. I could think of worse page filling, but this depends on your table and on what data is changed.

        Comment


        • #5
          You have a shared but unexamined assumption that everyone "generally accepts" about fragmentation, which is based on a widespread lack of understanding about how complex data access really is; it's not as simple as random versus sequential. My conclusion is that you're working against yourself and against the strengths of the storage engine.

          Comment


          • #6
            xaprb wrote on Mon, 12 March 2012 12:10
            You have a shared but unexamined assumption that everyone "generally accepts" about fragmentation, which is based on a widespread lack of understanding about how complex data access really is; it's not as simple as random versus sequential. My conclusion is that you're working against yourself and against the strengths of the storage engine.
            That's really the reason for the question.

            Comment


            • #7
              Well, I'd answer it by measuring -- my experience has shown me time and again that things like this can't be guessed accurately. I mean, even without SSD, I think you are quite likely to improve performance and avoid the long downtime you mentioned if you simply update the data in place instead of the process you're currently using. I could be wrong, but again it's pretty hard to guess this.

              Comment


              • #8
                Agreed, I need to run a test. I actually ran into an issue setting up the test anyway. We need our record id's to remain the same after an update. Of course loaddata only does a replace which changes the id of the record, and there doesn't appear to be a way to do an on duplicate key update.

                So, we might have to load the new data into a temp table, then merge the two manually? Unless someone can think of another way to do it.

                Comment


                • #9
                  I don't understand what you mean with "which changes the id of the record". What goes wrong when you use "LOAD DATA INFILE 'file_name' REPLACE"?

                  Comment


                  • #10
                    The table in question uses an auto incrementing id as the primary key. When you do a replace, it deletes the record and inserts a new one, thus changing the id. The id is used by our search engine for indexing. If it changes, our search indexes become stale and serve bad data. We need to update the records without deleting them and creating a new record. It doesn't appear that load data can do that.

                    Comment


                    • #11
                      You will need something unique in every record; it seems not smart to rely on the assumption that no record will ever change position or be removed from the csv file.

                      Comment


                      • #12
                        We don't care if they change in the csv, we're only concerned about the id's in the database itself.

                        Comment


                        • #13
                          1.
                          I know that you guys already have worked it out but I just wanted to emphasis that if your data fits in memory then the question of how it is stored on disk basically vanishes and you can live happily ever after without complicating things .

                          Prosperent wrote on Tue, 13 March 2012 17:30
                          We don't care if they change in the csv, we're only concerned about the id's in the database itself.
                          2.
                          Why is that?
                          If you update a record shouldn't the search engine be updated anyway since the original record isn't there anymore?
                          Or how do you handle records that are removed from the csv today? I fail to see how you can rely on that LOAD DATA with the complete CSV file as you do today, can keep the same id for the same record? You should never rely on an internal auto_increment PK to uniquely identify a record that has been imported from somewhere else.

                          3.
                          Why do you actually have such a large csv file that seems to be pretty volatile that you need to reload all the time? Could you change the workflow so that the system that exports the file only exports changes and then once in a while you get a complete export to make sure that you are in sync?

                          Comment

                          Working...
                          X