GET 24/7 LIVE HELP NOW

Announcement

Announcement Module
Collapse
No announcement yet.

split huge tables or just query differently?

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • split huge tables or just query differently?

    This forum looks like a great resource - thanks for creating it.

    I have a challenge that no doubt has been faced by others before. I have developed a rather complex visitor/event/click tracking script that logs to multiple tables. This is not just simple stat tracking -- so I must normalize the database and have each table related by primary key IDs, in many-to-one and many-to-many relationships ... hence lots of JOINS.

    The problem is, the tables over time have gotten rather large: up to 1.4 GB (22 million rows) and increasing relatively quickly right now. When I try to view some stats, my connection with the server just gives up or takes forever and chews up a lot of CPU processing power. I have INDEXED everything I can. Buffers have been increased. I've used EXPLAIN and even queries that are fully indexed take too long or don't run at all if it involves a huge table. This is on a dedicated server with 2 GB of RAM and Intel Celeron 1.7GHz (not fast, I know).

    Summary tables are an option for some stat reports but not for others.

    So it appears that this is a "big table" problem. Although from what I've seen here, a 1GB/12 mil row table shouldn't be any problem for mysql? However, I can't just let the tables grow forever...

    I see two solutions, aside from getting better hardware:

    1) Try to split all the tables up by date, ie, start logging to a new table when it gets to a certain size.

    or

    2) Better optimize my queries somehow.

    The problem with #1 (splitting tables), is that I've got primary key ID's starting in row #1 (date of first log) that may need to be referenced *today* by one or more other tables. (One reason why archiving isn't a solution). Assuming I can figure that out, there's the problem of SELECT having to span multiple tables if they contain data needed for the date range requested. UNION may work here but again, we're talking multiple split tables being UNIONed, joined and selected from.

    Regarding #2 (better queries), would it be better to try to get a subset of each huge table first? Like by date range ... and create some temp tables, then JOIN those? Or is that what mysql does anyway if part of the SELECT includes a WHERE datey?

    Hope this makes some sense.
    Any ideas?

  • #2
    Hi,

    bluem wrote on Wed, 08 November 2006 09:53

    I have INDEXED everything I can.



    Maybe this is the problem? If the queries are recurrent and there is a small number of different conditions in WHERE clauses or JOINS, you should be ok with 2-3 indexes. Have you checked the state of queries that are running? My guess is that inserts or updates are locking the tables for a releatively long time (due to a huge number of indexes needed to be updated).

    Moreover, I think you should give InnoDB a try , as it does a row-level locking and multiple inserts/updates/selects can go simultaneously. However, you should really reconsider the schema of the database. If I were you, I would do archiving anyhow and use small database with as few indexes as possible for run-time inserts and updates, and a big one for the selects. Of course, decisions depend on the actual needs of the application.

    Comment


    • #3
      Thank you for the reply.

      inner wrote on Wed, 08 November 2006 01:28

      If the queries are recurrent and there is a small number of different conditions in WHERE clauses or JOINS, you should be ok with 2-3 indexes. Have you checked the state of queries that are running? My guess is that inserts or updates are locking the tables for a releatively long time (due to a huge number of indexes needed to be updated).



      When I say I've indexed everything, I mean on the fields that are being used in WHERE clauses ... not really 'everything'.

      You could very well be right about locking -- mysql is very busy inserting rows constantly, every pageview, etc. Maybe 50,000 new entries a day, I don't know for sure.

      I'm not sure how to check the state of queries that are running (forgive my ignorance). How is this done?

      On that note, how DOES mysql do inserts/updates to a table that is at the same time being SELECTed from? Is it one or the other, or can they be done at the same time?

      I like your idea of "use small database with as few indexes as possible for run-time inserts and updates, and a big one for the selects". I might just have to go that route somehow. (Did you mean separate tables, or did you actually mean separate databases?)

      Thanks again.

      Comment


      • #4
        Good evening,

        bluem wrote on Wed, 08 November 2006 18:58

        When I say I've indexed everything, I mean on the fields that are being used in WHERE clauses ... not really 'everything'.


        Yes, I got the idea. My point was that sometimes very rare selects are not even worth indexes (especially huge indexes for multiple rows) as not having an index could save you some memory and disk writes. Updating indexes, as well as inserting data can be very expensive (in an IO throughput sense) if your disk drives or RAID controller has a small (or no) write cache.

        bluem wrote

        You could very well be right about locking -- mysql is very busy inserting rows constantly, every pageview, etc. Maybe 50,000 new entries a day, I don't know for sure.

        I'm not sure how to check the state of queries that are running (forgive my ignorance). How is this done?


        Actually it's as simple as "show [full] processlist" at mysql command line. To get involved in mysql performance monitoring a bit more, I can recommend you to try mytop (by Jeremy Zawodny) and/or innotop (by Baron Schwartz) - these are really great tools! (if you're using mytop with mysql 5.0, don't forget to change "show status" to "show global status" in mytop perl script to get a correct operation =)

        bluem

        On that note, how DOES mysql do inserts/updates to a table that is at the same time being SELECTed from? Is it one or the other, or can they be done at the same time?


        It mostly depends on the chosen storage engine. You can find a bit more about the innodb locking model (which is probably the most powerful one) here: http://dev.mysql.com/doc/refman/5.0/en/innodb-transaction-mo del.html

        bluem

        I like your idea of "use small database with as few indexes as possible for run-time inserts and updates, and a big one for the selects". I might just have to go that route somehow. (Did you mean separate tables, or did you actually mean separate databases?)


        I don't think this is very important in mysql (from the database schema point of view). I was referring to word "database" because you were talking about relations between multiple tables (although you are conidering splitting just one yet). However, it's not very important if it's one small table and a big one or if it's a bunch of small tables for inserts vs. big ones for selects as long as you have enough of them set in a "table_cache"

        Regards,
        Aurimas

        Comment


        • #5
          Thanks for the tips! I will check out mytop and innotop.

          "show process list" is very interesting!

          I guess I don't have a huge number of queries (today anyway) or they are fast -- because I had to issue the command many times to see one of the tracking insert/updates. (Is there a way to automatically refresh or reissue the command? maybe in the likes of mytop...)

          When doing a large query, State says "Copying to tmp table" for whatever time it takes, like 32 (seconds I assume?) for one of them.

          However, at one point during a large select query, two of my normal tracking queries attempted, and for those State was "Locked" for 3 secs and 1 sec respectively (they wanted the same table that the select query that was running).

          One thing: my SHOW STATUS report shows 3,260 Table_locks_waited over an Uptime of 6 days. I understand this is not a good thing.

          Quote:

          However, it's not very important if it's one small table and a big one or if it's a bunch of small tables for inserts vs. big ones for selects as long as you have enough of them set in a "table_cache"


          If you have time, could you expound on that statement? I don't know what my table_cache setting is. Is it in show status somewhere? If it affects the size of tables I can access efficiently, then obviously it is something I need to know before breaking them up into smaller tables.

          One last thing for clarification:
          Quote:

          I don't think this is very important in mysql


          Are you referring to the strategy of separating the tables into runtime insert tables (smaller, no indexes), and 'select' tables (larger, indexed)?

          Thanks,
          David

          Comment


          • #6
            Good day!

            bluem wrote on Wed, 08 November 2006 22:46

            I guess I don't have a huge number of queries (today anyway) or they are fast -- because I had to issue the command many times to see one of the tracking insert/updates. (Is there a way to automatically refresh or reissue the command? maybe in the likes of mytop...)


            Yes, mytop will help you there. It's somewhat familiar to UNIX top command, you can set the refresh rate there.

            bluem

            When doing a large query, State says "Copying to tmp table" for whatever time it takes, like 32 (seconds I assume?) for one of them.


            Take a look at this post: http://forum.mysqlperformanceblog.com/s/t/70/
            if your query uses tmp table, make sure it creates it in memory, or even better - if it's possible, try to redesign your query (and application part probably) to avoid using temporary table.

            bluem

            However, at one point during a large select query, two of my normal tracking queries attempted, and for those State was "Locked" for 3 secs and 1 sec respectively (they wanted the same table that the select query that was running).


            I suppose on higher loads you will see more of them. In hyper-active applications table locks are really bad.

            bluem

            One thing: my SHOW STATUS report shows 3,260 Table_locks_waited over an Uptime of 6 days. I understand this is not a good thing.


            Please, compare it to "Table_locks_immediate". Maybe the rate is not very high?

            Quote:

            However, it's not very important if it's one small table and a big one or if it's a bunch of small tables for inserts vs. big ones for selects as long as you have enough of them set in a "table_cache"


            bluem

            I don't know what my table_cache setting is. Is it in show status somewhere? If it affects the size of tables I can access efficiently, then obviously it is something I need to know before breaking them up into smaller tables.


            It's not about the table size, but about the number of opened tables from file system. You can see the value by issuing "select @@table_cache;". And the efficiency by looking at status of "Opened_tables". If this is high, then please check if "Open_tables" is equal to "table_cache", you will probably have to increase this at your my.cnf.

            bluem

            One last thing for clarification:
            Are you referring to the strategy of separating the tables into runtime insert tables (smaller, no indexes), and 'select' tables (larger, indexed)?


            Touché!

            Have a good day!

            Comment


            • #7
              Many thanks inner, again.

              Regarding this thread: forum.mysqlperformanceblog.com/s/t/70/ --
              its conclusions weren't exactly clear .. but I believe it's saying that mysql will always put the tmp file to disk if the query involves tables having VARCHAR, TEXT or BLOB type fields, and hence be slower. Hope that conclusion is correct. So, yeah, I see no way around that as far as writing the queries. I have no idea if my server uses tmpfs (clueless, once again).

              My status report shows:
              Table_locks_immediate: 1,530,068
              Table_locks_waited: 3,260
              Created_tmp_tables: 4615
              Created_tmp_disk_tables: 149
              Open_tables: 1,118
              Opened_tables: 1,455

              table_cache is: 32,512

              I think these are good ratios but I don't know what is normal or good.

              inner, you have been very generous with your time, so I hate to ask one more time for clarification, but I'm confused. In your first post you said:

              "However, you should really reconsider the schema of the database ... use small database with as few indexes as possible for run-time inserts and updates, and a big one for the selects."

              But in your recent post you verified that "I don't think this is very important in mysql".

              So is the idea of a small, non-indexed, insert table beneficial to performance or not? (again, sorry for my dull-headedness)

              Comment


              • #8
                No problem, bluem, just hope it helps.

                Regarding temporary tables - as far as I understand, the essential point is:
                - if the created temporary table properties apply to heap storage engine (i.e. there are no BLOB or TEXT columns), table size is small enaugh to fit in tmp_table_size and max_heap_table_size and the table is not "with a large row length", then it's created as a HEAP table and, of course, stored in memory. Otherwise, it is stored as MyISAM table in your "tmp" directory.

                If you didn't do anything to put your tmp to tmpfs, then most probably you are not using tmpfs for your scratch-files. A very small article about tmpfs on Linux: http://www.cyberciti.biz/tips/what-is-devshm-and-its-practic al-usage.html

                bluem wrote on Sat, 11 November 2006 23:49

                My status report shows:
                Table_locks_immediate: 1,530,068
                Table_locks_waited: 3,260
                Created_tmp_tables: 4615
                Created_tmp_disk_tables: 149
                Open_tables: 1,118
                Opened_tables: 1,455
                table_cache is: 32,512


                These are not very bad values, therefore I don't see your bottle-neck here, unless these 3260 queries, that have been waiting for other threads to unlock the tables, were in a queue for a really long time. Maybe you should check your slow query log? Not only it will show the queries that have been waiting for the tables to unlock, also you should see the bad ones here, that kept others waiting.

                BTW, to bypass most common mistakes in applications, queries and server configuration, please check the latest presentation by Peter on "MySQL performance optimization" ( http://www.mysqlperformanceblog.com/files/presentations/OSDB CON2006-MySQL-Performance-Optimization.pdf), it's really great. You'll spend up to an hour reading it but it might save you weeks googling, writing and reading as it's like all-in-one article.

                bluem

                inner, you have been very generous with your time, so I hate to ask one more time for clarification, but I'm confused. In your first post you said:

                "However, you should really reconsider the schema of the database ... use small database with as few indexes as possible for run-time inserts and updates, and a big one for the selects."

                But in your recent post you verified that "I don't think this is very important in mysql".


                that "I don't think this is very important in mysql" is more related to your question "a separate databases, or a separate tables?", as it's up to you, how many tables are you using for data gathering and so on, so don't invest too much in that proposition.

                bluem

                So is the idea of a small, non-indexed, insert table beneficial to performance or not?


                IMO, yes. Reasons:
                1. small table is fast anyhow
                2. not using indexes for inserts reduces the number of I/O operations.
                3. as you're collecting statistics, inserts are probably the most frequent (updates would need some indexes, however).
                4. inserts will not compete with selects

                Comment


                • #9
                  Awesome, you've clarified everything. Thanks for the links. I will definitely study Peter's presentation.

                  I don't have slow query logging enabled (don't think so anyway) - if it was logging, is there a standard way of pulling up the log?

                  FYI, status says:
                  Slow_launch_threads is 0
                  Slow_queries is 67

                  Comment


                  • #10
                    Hi,

                    Your problem is typical over time tables become large which first make them to consume more CPU due to large index ranges etc and then become disk bound which causes dramatic shutdown.

                    In your case think what queries you're executing actually need to do. If query will need to fetch random 10000 rows from the disk which are all located in random locations and nothing is likely cached the query will be slow and you will need find a way to do same thing some other way.

                    More detailed answer would need more information.

                    Comment


                    • #11
                      I've gotten some good ideas as to what to try now -- appreciate the response. I read over your mysql presentation (link from "inner") and it's very valuable. My only wish is that it had even more detail to fill between the main bullet points for newbies like me ) Nevertheless, I printed it out and have gleaned some great tips from it. Thanks!

                      Comment


                      • #12
                        Hi bluem,
                        Sorry to butt in here, it sounds like you have got it pretty much sorted.

                        Have you considered splitting the main data tables by time period, say 3 month chunks into identical structure tables, with names based on the period - e.g. q1_data, q2_data, etc.

                        That way your analysis application could be clever as to which sections of data it analyses, and could build union queries or use a MERGE type tables to combine those periods into more usable chunks of time - years, etc.

                        This way you can split the analysis data onto multiple disks if available - I have found that most of the bottle-necks we are experiencing are disk based rather than anything else - our 'database' comprises 400+ databases each with 100+ tables - over 800Gb total. Mainly select queries, with multiple joins which in most cases cross databases (I still find it amazing how mySQL handles all this without breaking into a sweat!)

                        I also agree with inner's first post, a small table used to collect the new data which is flushed into the analysis tables say once a day would certainly stop the problems you seem to be experiencing with locking. I know nothing about innoDB, so can't comment on how that would handle things better / differently.

                        Hope that helps in some way

                        Has.

                        Comment


                        • #13
                          Thanks for the input, Hasgaroth.

                          What you suggested is what I had pretty much decided upon, so it's good to see an affirmation of the idea. Somehow the data has to be split by date. I was thinking that since traffic levels can vary dramatically over time it may be better to split tables up by size (# of rows), and have a "master" table that records the start and end dates for each split table so queries would know which tables to join.

                          Anyway, I'm having to rewrite my entire application. But that's OK. It'll be a lot better for it.

                          A question: how do you do joins across multiple databases? I thought a query in PHP, for instance, always had reference to one database. I'd rather have a separate db for my traffic/stats but on occasion, it needs to be able to join with other tables, such as customer orders.

                          Comment


                          • #14
                            You "connect" to one database and it will become default but you can access tables from other database, ie

                            use dba;

                            select * from a join dbb.b on a.id=b.id;

                            Will join two tables in two databases )

                            Comment


                            • #15
                              Wow, what a great thread. This has just helped me get my head round a number of problems we are experiencing. Thankyou everyone who contributed.

                              Comment

                              Working...
                              X