Announcement Module
No announcement yet.

Back to basics with a big table (65+ million records)

Page Title Module
Move Remove Collapse
Conversation Detail Module
  • Filter
  • Time
  • Show
Clear All
new posts

  • Back to basics with a big table (65+ million records)

    I have a large table (MyISAM with 65+million records) made from the logs from a security product. The logs contain:
    Date, Time, IP address, Userid, Process name, File name, Event code, Sub code

    I then do detailed queried against this data for research. It is not networked so only ever one user - me. However it has gone very slow for some of the queries I am using now. I have added a column with the MD5 hash of the Process name so that I can do faster queries on process name.

    I want to start again, however I can not see a logical way to split the table or produce other tables. Any advice? This must have been done before with things like web log tables? For instance the URL field is very similar to my process field - hence the MD5 hash approach. (Process field eg; C:\Windows\System32\dhdfgg.dll )

    I have several fast machines available and I wondered how I could split the table across several servers? I get bound on both CPU and/or READs depending on the query. I was hoping I could split the work across several machines?
    Thanks very much

  • #2
    It's probably a little early to jump to multiple servers.

    I've got some questions:

    - How long is too long? A few seconds? 10+ minutes?
    - What queries are you doing and what does the EXPLAIN say?
    - What does SHOW TABLE STATUS say for this table?
    - What does SHOW CREATE TABLE say for this table?

    Regardless of those answers, I have some suggestions:

    First, I'd shrink any columns you can shrink. Example: IP Address is a 32-bit int, if you're storing it as CHAR(15), you're wasting 11 bytes per row (700+MB, in your case).

    Instead of an MD5 hash on things like process name or file name, create a table for process names and another for file names, each with an auto-incrementing primary key. Then your table would become something like:

    Date, Time, IP address, Userid, Process_name_id, File_name_id, Event code, Sub code

    This makes aggregate operations much more efficient, but it will probably increase the cost of sorting by process name or file name.

    Your goal should be to have a fixed width table - that means no nulls, text fields, varchar fields, etc. Then you will have the fastest possible layout.

    I can give you more suggestions if you answer some of the questions above.


    • #3
      Thanks for the answers.

      The queries vary but a starting one might be:

      SELECT DISTINCT(process), COUNT(userid) FROM table GROUP BY 1

      Without an index it takes 24 hours +, but then adding an index takes a long time as well.

      I like the idea of creating other tables, with things like process name but I am not sure how to do that given they are not necessarily unique and I would need to maintain the 'process' table before I could populate the main log table? Or don't I care because of the auto-incrementing primary key in those tables?

      So, add process names to the process table and then load the main table with a query that includes a select from the process table?



      • #4
        So yes, you'll want the process name, for example, to be unique in the processes table. If it's not, there's no advantage to substituting an int for the string. If you're working with a static data set, then populating these tables once is not very difficult. Otherwise, you are correct, you'll have to maintain these extra tables before any rows can be inserted.

        Also, your query struck me as strange. If you're trying to count the number of different users to use each process, the query would be:

        SELECT process COUNT(DISTINCT user_id)FROM tableGROUP BY process


        • #5
          You're quite right about that query! It was late and I did write down some rubbish. I was using the query you suggested.

          I am going to think about the idea of creating extra tables. Because I do these one-off queries I don't mind spending time creating extra tables for efficiency. I will try to leave the BIG data table without too much indexing etc and then create other tables.

          Even if I use a single huge table for 'raw' data, and then create a new large table in the efficient way you suggest. Assuming I only update the data once a week then this would still work.
          - update raw data into raw_data_table
          - create unique_host_table
          - create unique_process_table
          - create efficient_large_table, with host_id and process_id as per your first post

          Its a bit of work but should leave me with a workable table for futher research/queries.

          Thanks for your help.


          • #6
            One last thing I figured I'd mention: sometimes the database isn't the best place to do the work. If you've got access to a unix system and your data in tab delimited format, you can create a file of just unique process names. For example, if its column 4 of your file you could do:

            cat mydata.txt | cut -f4 | sort | uniq > process_names.txt

            And then use LOAD DATA INFILE to get the data into MySQL.

            Actually, you could do your whole query outside of MySQL in the same manner, by using this crazy command which assumes columns 5 is user_id:

            cat mydata.txt | cut -f4,5 | sort | uniq | cut -f1 | uniq -c

            Your mileage may vary.