HUGE Tables and Indexes

  • Filter
  • Time
  • Show
Clear All
new posts

  • HUGE Tables and Indexes

    I have been given the task of loading 6.5billion records (yes, that is not a typo, six billion records) into mysql database running on Ubuntu 8.04.

    I have successfully done so using:
    load data local infile 'FILENAME' into table TABLENAME FIELDS TERMINATED BY ',';

    This took around 10 hours.

    I used the following procedure to create the INDEXES:

    Create the data without indexes
    Move (rename) the table.MYD file to a backup
    Create an empy MYD file using TRUNCATE TABLE
    Add indexes
    Move the backed up MYD file back to table.MYD
    Run to fix:
    myisamchk -q -r -O key_buffer=4096M -O sort_buffer=2048M -O read_buffer=640M -O write_buffer=640M

    This procedure took around 15 HOURS

    Once complete file sizes:
    MYD --> 410Gig
    MYI --> 200Gig

    Server SPEC:
    DELL 2950 Dual Quad Core 2.5Mhz
    32Gig RAM
    Data on 7 x 7200kRPM (STRIPE) iSCSI Connected Disks.

    At the end of the project I will have 10 times the amount of data to load.

    I would be interested to hear comments/ideas on ways to improve the load and index time. Or any other suggestions working with this quantity of data.



  • #2
    It may be possible, depending on the nature of the data, to shard it to multiple tables (and servers). Doing that should decrease indexing time, and maybe slightly insertion time.

    Is there a reason you're using MyISAM? Or would you be able to look into other storage engines?


    • #3
      Yes, I could consider using another storage engine.
      What would you suggest?
      Can you elaborate a little on the sharding?


      • #4
        You can shard by putting the data into multiple tables (and on multiple machines too). So each table has a fraction of the data. This won't help with the data size, but it can speed up indexing time; inserting into a table with 10^5 rows is faster than inserting into a table with 10^10 rows.

        Depending on what you're going to be doing with this data, and the nature of the data (can you give the schema definition and a brief description of what the data is?), InnoDB might be better. I'm not 100% sure on this though.


        • #5
          CREATE TABLE tablename (
          date INTEGER NOT NULL,
          time INTEGER NOT NULL,
          xxxxx CHAR(1) NOT NULL,
          xxxxxx CHAR(16) NOT NULL,
          xxxxxx INT NOT NULL,
          xxxxxxxxx INTEGER NOT NULL,
          xxxxxxxxxx INT NOT NULL,
          xxxxxxxxxxx INTEGER NOT NULL,
          xxxxxxxxx CHAR(1) NOT NULL,
          xxxxxid CHAR(4) NOT NULL,
          xxxxxxxxxxxx CHAR(1) NOT NULL,
          xxxxxxxxxxxxx CHAR(1) NOT NULL,
          seq_num BIGINT NOT NULL,
          xxxxxxxxx TINYINT NOT NULL,
          xxxxxxxxxxxxxx TINYINT NOT NULL,
          xxxxxxxxxx CHAR(1) NOT NULL,
          src CHAR(1) NOT NULL,
          xxxxxxx INT NOT NULL


          • #6
            You don't have a primary key. Is seq_num unique? If so, make it the primary key if you use InnoDB, and insert in primary key sequential order. What are the indexes you have?


            • #7
              I don't have a primary key, I found that it slowed down the load data. There are no unique fields, the seq_num is not really sequence, every approx 300mill enteries it starts again from 1.

              I am currently using MyISAM.


              • #8
                The only thing I can suggest given the current information is to shard, or optimize your schema and indexes, eliminating as much as possible.

                Also, you should look into what MyISAM does when it exceeds the pointer size and/or data length. What version of MySQL are you using? Could you also show the indexes that you have on the table?


                • #9
                  So, You need to import data and indexes. You could create temporary table without any index and main table with indexes. After that, You import rows to the temporary table. This may take a while. Then You create script that fetches some rows from temporary table and puts them to the main table. You could run that script manually, but cron should do the trick. For example:

                  * * * * * /path/to/the/php /path/to/the/script
                  will run importing script every minute.
                  This should be faster than playing with files and fixing indexes.


                  • #10
                    How often do you plan to load 6 billion rows? If the answer is just once (outside of testing), then you shouldn't be choosing a storage engine or a sharding strategy based on load time. Presumably, once you get the data in the database, there's something you want to do with it. It's those access patterns that should be guiding your design decisions.

                    You're going to have more problems than just loading given your data size and server specs. Have you considered analyzing your requirements and finding a way to load less data? Do you really need to be able to answer any possible question against your raw data? I've seen plenty of examples of people loading raw log file data into the database because they believe they need ad-hoc querying, when practically, they only do several very simple aggregations or use the data to build summary tables. Sometimes it really is necessary, but it's always very expensive. If you have a set of well defined questions you know you want to answer about your data, you should do your best to pre-process that data before it enters the database so that its in an optimal form to answer those questions.


                    • #11
                      The data will grow to 66billion entries over the next couple of months, so the loading will happen more than once.
                      Yes, we have looked at all the data and it is such that we need it all, partial data will be meaningless.


                      • #12
                        1. Why would that be faster?
                        2. Can you explain the import script?
                        3. Why would I need to cron it, would this not be a one time action?


                        • #13
                          Ok, if you really need all the data, you should consider either using the MERGE storage engine or if you have or can upgrade to 5.1, you can try using partitioning. This is basically like splitting the table up into multiple tables with the advantage that MySQL will figure out which tables to access when you read and write. As arya pointed out earlier, this approach will reduce index size (actually more, smaller indexes), but the real benefit for loading is that these different tables/partitions can be kept on different disks, which means you can parallelize your loading. You'll need to separate the raw data before loading it, but you'll be making better use of your disk controller. Just be aware that both MERGE tables and partitioning have their limitations, so you should read the docs thoroughly before choosing one, but basically the idea is the same.

                          Also, if you're working with MyISAM, and you're read-only when you're not loading data, you can consider using compressed tables. This is something I've never tried, but the same technique is very popular among column oriented databases since trading CPU cycles for I/O bandwidth for large data sets, like yours, is usually a win.

                          If you're expecting high read/write concurrency once you have your data, then you might consider using InnoDB. Though you will see your data on disk double or triple in size, which definitely won't be much fun for you.


                          • #14
                            Thanks for all the responses.
                            I will investigate the options mentioned here and hopefully find the best solution.



                            • #15
                              will this data be modified?
                              consider the Archive engine.