Announcement

Announcement Module
Collapse
No announcement yet.

One query on a big table vs multiple queries on multiple smalle ones

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • One query on a big table vs multiple queries on multiple smalle ones

    Hi!

    I'm glad I found this web-site. The fact that Peter decided to start a forum makes me twice as glad ) The question that I have is obviously related to MySQL performance as the web-site name implies.

    I need to estimate which one of the following approaches provides better performance.

    I'm implementing a simple mapping table, which operates in terms of numerical object IDs (OID) and up to "N" numerical aliases (Alias1, ..., AliasN) of different alias types (AliasType1, ..., AliasTypeN). One OID can have zero or one Alias of the given AliasType associated with it. So, a pair of OID and AliasType uniquely identifies an Alias (but it may not exist at the time of the query). Also, a pair of an Alias and AliasType uniquely identifies an OID.

    Since it is not very likely that any given OID has Aliases of all AliasTypes defined, I scratched out the most obvious implementation of a MySQL table that would have the following columns: (OID, Alias1, Alias2, ..., AliasN) and every column would be a unique key. In this case my table could get rather sparse and with tens of millions records the efficiency would be rather pathetic.

    Another approach suggests creating a single three-column table: (OID, AliasType, Alias), which can have two indicies (OID, AliasType) and (Alias, AliasType). In this case I shrink the width of the table but multiply the row count.

    And the last approach I thought of is the derivative of the above. I can split this long 3-column table into N 2-column tables, where every table would contain mappings of an OID to an Alias of a particular AliasType.

    Please correct me if I'm wrong in my expectations.

    1. I think that a lookup for a particular Alias or an OID, given the AliasType and either OID or Alias, should be fast in the last design approach. This conclusion is based on the fact that MySQL server would have to make a lookup on a relatively small data set with the luxury of having a unique index built for both types of queries.

    2. If I need to get all the aliases of all types that correspond to a given OID, the earlier design (with a 3-column table) should provide significantly better performance. I think this is true because I would have to fire N queries to every 2-column table versus one query to a 3-column table with a resulting dataset of multiple rows.

    Am I correct in the statements above?
    May be there is any better solution for MySQL than what I'm thinking about?


    Thanks a lot!
    /Sergey

  • #2
    Hi Sergey,

    Just some quick thoughts/questions that might help )
    (I'd be very interested to hear Peter's thoughts on this too; it's always good to get a better understanding of these things)


    First, some questions to consider that will help determine which approach suits you best:
    * Do you expect the number of OIDs/aliases per aliastype to be distributed evenly across aliastypes?
    * Will data for some aliastypes be fairly static or will the data for all aliastypes be changing a lot?
    * How many different aliastypes do you expect to have? And is this number likely to change much?
    * What is your ratio of per-aliastype queries vs aliastype-independent queries?


    I think if you have very different update/access patterns per aliastype then it _might_ make sense to split entries into separate tables per aliastype. Even then, I'd probably only consider it if you need very very high query throughput, and if queries to find aliases from OIDs is very infrequent, and queries to find OIDs for a specific alias and aliastype as very frequent. (because as you say you'd need to do lots of separate queries or a large union/join if there's a table per aliastype and you need to do a aliastype-independent query)

    The hassle of adding new tables for new aliastypes and modifying the queries just doesn't seem worth it, even if you automate it somehow. One advantage of splitting them up though, if you have very different access patters per aliastype, is that you might be able to get more query cache hits on the tables which have data that changes less.


    Personally I'd start by benchmarking your first alternative approach (A single three-column table: (OID, AliasType, Alias), which can have two indicies (OID, AliasType) and (Alias, AliasType) to see whether it meets your performance needs or not. If it does (and I think it should unless you're dealing with very high throughput and very large data volumes) then it seems better to me because it offers:
    * A pretty standard approach to solving the problem.
    * Flexibility for different select queries
    * Flexibility/simplicity when running updates, complex deletes etc.
    * Less impact when adding aliastypes.

    OTOH, if you're nearly always doing queries per-aliastype and you need very high throughput then maybe your second alternative is better.

    Can you benchmark them and post your findings? I'd be interested to know..

    HTH,
    Toasty

    Comment


    • #3
      Hi Toasty,

      Thanks for the elaborate reply with some serious questioned raised. Here are some answers and additional details.

      * Let's say there are only 12 AliasTypes. The typical query would provide an Alias of one of 10 AliasTypes and request an Alias of the remaining two AliasTypes as an output. For example, given some Alias9 I have to provide the value of Alias1 via intermediary lookup for an OID. This task can be accomplished via the following steps:
      1. Using the table, which maps aliases of type 9 into OIDs I retrieve an OID that corresponds to Alias9 value.
      2. I use this OID to retrieve the desired Alias1 via corresponding table.

      So, roughly two tables will experience 5 times the retrieval load as the remainig 10.

      * The number of AliasTypes is known and limited at approximately 40 types. The overhead of adding more tables as new AliasTypes are discovered is negligible, as it is a very rare case and the software can be shutdown for the maintenance of that sort.

      * Some tables are expected to have fairly static data but I cannot really make any quantative predictions yet.

      * The number of queries per second can be pretty high, in the order of 3000 retrival queries per second. But it is hard to baseline this number as it depends on the hardware configuration and the degree of its utilization by other applications.

      * There is no AliasType-independent queries as any retrival is qualified by an AliasType, otherwise, the lookup is ambigous.

      So, these are the additional important details that you pointed out. Hope it helps to define the problem better.

      Or, let me take a step back and describe the problem in its original form. Maybe it could bring up some additional ideas on how to implement that in a better way than what I have described earlier.

      So, I need to store information about up to 50 million objects with predefined set of attributes. Each object should have a unique ID (I referred to it as OID above), a predermined set of alternative IDs (Aliases in the text above), and the rest of the attributes are just the properties of objects that can be looked up but can never be used as the retrival keys. I receive information in portions with one or more Aliases and zero or more Properties. After I receive every portion of information I have to determine if it belongs to any existing object and either fill-in the empty fields or create a new record with a unique OID. Later, I an be queried for any of the Aliases or Properties given one of the Aliases.

      Here is an example. Let's say, I'm building a list of workstations in the network. Every workstation is uniquely identified by an IP address, MAC address of the NIC, and host name. All these attributes are aliases. Every workstation can also have its brand, model, and OS name as Properties.

      The straightforward solution would be to create a single table like that:

      CREATE TABLE Workstations ( Oid BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, IpAddr INT UNSIGNED, MacAddr BIGINT UNSIGNED, HostName VARCHAR(32), Brand VARCHAR(32), Model VARCHAR(32), OsName VARCHAR(32), PRIMARY KEY (Oid), KEY(IpAddr), KEY(MacAddr), KEY(HostName) );

      If I receive a message that has only one Alias (IpAddr, for instance), I check if there is a record with this Alias. If such a record exists, I don't do anything. Otherwise, I create a record that has only two non-NULL fields: Oid and IpAddr.

      If I receive a message that has more than one Alias in it, I can safely assume that all these Aliases correspond to the same object. Thus, I can either create a new record if none of the Aliases exist in the table, or do nothing if all of them return the same record, or merge two or more records if lookups for all these Aliases one by one returned different records.

      Similar approach can be used with Properties.

      Such an implementation seems to be inefficient in case if I have a sparse table, where only a couple of fields are non-NULL. I don't care as much about the amount of disk space but I do care about the response time. That's why I started to think of alternative solutions that I described in the original message. Am I going in the right direction with it? Or can MySQL handle such sparse tables efficiently?

      What if my table had about 15 aliases 4 to 8 bytes each and about 400 bytes worth of properties combined? How would MySQL handle such a table if it had about 50 million sparse records of that sort where about 1/3 of alises and properties would be known? Would it be any better than one of those more complex schemas shown in the original message?

      Thanks for reading! )
      It would be supernice if anybody would express opinion on that.
      Thanks again!

      /Sergey

      Comment


      • #4
        Hi Sergey,

        Just a quick question to help me while I think about this. I'd just like to make sure that I've understood how you do your OID merging/collapsing.

        Would it be possible for example, that through partial input information you could have the same real-world object represented by say 10 separate OIDs, each with 4 different aliases defined? Would you then have to try to merge some or all of these OIDs if you got an insert/update request that had a new combination of these aliases provided together?

        eg, say you start with:
        OID1 only has values for alias1->alias4 = (a,b,c,d)
        OID2 only has values for alias5->alias8 = (e,f,g,h)
        ...
        OID10 only has values for alias37->alias40 = (w,x,y,z)

        You then get an insert request that has (alias38=x, alias3=c) so you'd;
        Lookup alias38 to see if 'x' exists. ('x' is assumed to be globally unique for alias38 I presume?)
        Lookup alias3 to see if 'c' exists. ('c' is assumed to be globally unique for alias3 I presume? etc)

        You notice that they both exist, so you create a new OID:
        OID11 has values alias1->alias4 = (a,b,c,d) and alias37->alias40 = (w,x,y,z)
        AND delete OID 1 and OID 10

        I take it that as part of creating OID11 from merged values of OID1 and OID10 you'd never get conflicts of property (non-alias) values?
        I also take it that every alias value is unique so you could never get alias value clashes on OID merge?

        Cheers,
        Toasty

        Comment


        • #5
          Hi Toasty,

          You are right on target. Here is a few more details that pertain to that "merge/collapse" scenario:

          1. It is not determined if a merge creates a new record with new OID, or if it uses one of the existing records and just fills in some of the fields in it and drops the other record.

          2. The above scenario becomes more funky if I get a contribution that has three or more aliases (z.B., (Alias3, Alias20, and Alias37)). In that case I have to merge three records, and I either have to determine, which one of the three OIDs to use, or, as you suggested, to create a completely new row with a new OID. I like the latter approach better for its genericness but I need to think whether it is ok for the pupose of the application.

          3. Merge scenario may also have a hard-to-handle condition when the two rows being merged have overlapping aliases defined and the values of overlapping aliases may be conflicting. This may very well be a synthetic scenario and I need to think more to say if it is realistic at all.


          /Sergey

          Comment


          • #6
            One more question Sergey,

            What are the 'external' operations that this database will be required to support (irrespective of how it's designed)?

            Will it mostly be used for inserting/merging/updating data about OIDs or will there also be a heavy select load in parallel?

            From what you've said so far I'm not 100% sure whether the selects which will do lookups by alias are generated from the need to merge OIDs (i.e. to see if any matching data exists for a particular alias) or from genuine 'external' requests.

            Tar,
            Toasty

            Comment


            • #7
              Thanks for your ideas Toasty! Here is more details on the things you were wondering about.

              When the database is empty, the majority of operations are going to be merges, insertions, and deletions (updates are covered by merges). As it collects more details about the objects, the balance will shift towards mostly select operations. At the limit, given that we know all the properties of all the objects, we would still observe some amount of merges and other data altering operations going on, since the nature of aliases is such that some of them are being updated periodically and my storage has to learn these new mappings (for instance, in the example above, the IP address could have been reassigned by an DHCP server). For simplicity, we can assume that the old aliases are expired once the new ones arrive.


              Cheers,
              /Sergey

              Comment


              • #8
                OK I've had a bit more of a think about this and I think I'm changing my mind a bit )

                I think the main thing to aim for is minimising the size of the data and indexes here as 50M objects with 40 unique properties (aliases) plus a bunch more non-searchable properties results in a fairly big database. Combine that with a 3K/sec query rate and you definitely need to be careful. (and have a lot of RAM )



                If you go with the original "one big table with a column per alias and a column per property" approach, then you're going to end up with the problem of managing a >50M row, 41 col table with 41 unique indexes on it (one per alias, and one for the OID)

                This will have really massive RAM requirements (you'll want the indexes in RAM if you're doing 3K queries/sec unless you're willing to scale out/up your DB platform or partition your data) and is generally not a great approach IMO. (It's possible that your 'hot' data is a very small subset of the data stored in your DB so you might get away with this, but in your case it doesn't really sound like it)


                Your alternative approach where you have a "single three-column table: (OID, AliasType, Alias)" falls down a bit I think, because I presume that each alias could be of different type? (e.g. What would you define the 'Alias' column to be in this approach? A varchar(255) just in case? ;] )
                Otherwise it's ok I guess, other than the fact that for 50M objects you'd have up to 50M * 40 rows (one for each alias for each object) so your table could have up to 2,000,000,000 rows. (more likely 750M?) Not fun to manage I imagine.


                If you take your other approach - splitting alias values into separate tables by aliastype - then you'll still have say 750M rows, just split across 40 tables. This makes the data more manageable from an admin POV, but the number of select queries you're going to have to do increases roughly by a factor of N, where N is the average number of given aliases in any RequestData/UpdateData demand you receive. In addition, upon OID merge, you're going to need to run inserts and deletes per aliastype table which means that you probably should use explicit transactions, and also that you're going to have to run more, but smaller, DML statements (not usually a problem in itself).
                All of these statements will basically be unique key lookups though, so will be super fast.

                So, although I have a niggly feeling that there's some vastly better way to do this, it seems that the approach with one table for OID and non-alias properties, and another 40 tables keyed on aliasvalue to lookup OID is the way to go. You're still going to have to do some serious micromanagement (be careful with your column definitions / tune your server and mysql config / etc) and have some fairly meaty hardware to get the performance you want I think.

                Another advantage of the table per aliastype approach is that if you really need to - and if you're not bothered about transactional changes to the data (a big if!) - you could put the different tables on different servers and have your app direct queries appropriately.

                Also, especially if you're expecting a very high number of requests to be made to change the data, you're going to have to be careful if you're running multiple separate statements that objects haven't been merged by one process in the middle of your current attempt to merge them, etc. I think this could happen with any DB design though, if requests that can merge data can be made in parallel.


                As a disclaimer on all of the above - I've never worked on something exactly like your problem before although I have done some similarish things. My approach each time is to setup a prototype, benchmark, observe, tweak, re-benchmark etc. Eventually I'm either happy or if not I try something different or, if I can't think of a better approach, I look into scaling out the db platform or upgrading the db server, depending on how things look.

                HTH anyway!
                Toasty

                PS If anyone else would like to join in with some thoughts on this I'd be interested to hear them

                Comment


                • #9
                  Toasty,

                  Please let me thank you once again for your constructive and elaborate feedback!

                  I have to admit that the numbers I mentioned have been elevated to account for future extension of the data set. One of the goals we are trying to achieve is the scalability of the solution. In this light, the solution with a number of primitive mapping tables seems to have much better "horizontal partitioning" possibilities as compared to having a single alias table, for example.

                  The other side of the story that you emphasize in the reply is the hardware limitiations. The application I'm working on has to run on a relatively weak hardware (2Gb of RAM, 1 Xeon CPU, SATA hard drive, Linux). Given that, the major design consideration is to be able to distribute data across a little "farm" of these weak computers.

                  Ideally, I wish MyISAM API would be documented and available for direct use bypassing the SQL engine part...


                  Cheers,
                  /Sergey

                  Comment


                  • #10
                    >Ideally, I wish MyISAM API would be documented and available for direct use bypassing the SQL engine part...

                    Hi again Sergey, it's funny you say that because just this morning I was thinking you might well be better just doing this with hashtables in your favourite programming language


                    I've never looked at using the low level MyISAM APIs - in fact, the only reference I've seen to them is at the bottom of the following page - and it's hardly a lot of use! http://dev.mysql.com/doc/refman/5.0/en/tips.html

                    I have experimented with the HANDLER commands a little though - http://dev.mysql.com/doc/refman/5.0/en/handler.html - but they're never really been appropriate for my needs (yet) and might not really help you much either, I don't know.

                    TBH you might find that you're just fine with MyISAM via the normal SQL interface. It's really going to depend on how large your 'hot' data set (working set) is. Even if your DB is 50GB on disk, if only 512MB of it is accessed frequently you'll probably see really good performance as it'll fit in RAM.

                    In reality for your app having only 2GB of RAM and a single processor might be a bit of a pain - I'm not sure. For reference, I have a 42 million row 10GB (about 5GB data, 5GB index) innodb database with only 1GB allocated to the buffer pool (4GB in the box) and that runs fine at maybe 1K queries/sec with moderate CPU usage. (dual proc xeon box) We have a moderately sized working set too, though it's not as random-access as yours could turn out to be.

                    With limited hardware resources you might be better looking at alternative solutions if you are happy to give up some of the benefits of running on a full RDBMS - but I'm not sure what such alternatives might be. Personally I'd setup a prototype and run some benchmarks and see where that leaves you. Having your data in MySQL vs a perl hashtable is usually worth the overhead

                    Good luck, and if you have time it'd be good to hear how you get on!

                    Toasty

                    Comment


                    • #11
                      Hi Toasty!

                      My apologies for the delayed response. Things have been hectic lately...

                      To tell you the truth, writing our own simplified database engine was considered also ) Moreover, the current project is targeted to enhance the older implementation, which relied upon such a custom database engine with hand-crafted b-trees, mmap()-ed data access, and even a rudimentary transaction log. The primary goal is to allow comparable or better performance and succeed in the scalability of the solution given a predefined and restricted hardware configuration of a single computational unit. That's why we turned to MySQL. Originally, it seemed to be like an overkill because of the overhead it adds, but after a few evaluations we decided that it may be our best bet.

                      Thanks a lot for mentioning the "HANDLER" command. Based on the fact how primitive my queries are, this command may provide exactly what I need and strip off all the overhead I'm not very willing to cope with. However, the downside of it is that I am very likely to loose compatibility with other database servers because the "HANDLER" approach circumvents the abstraction provided by SQL as a standard data query language.

                      You are right on target again when you talk about the "hot" data set. It shows that you do deal with large data sets ) In my case, there a "hot" data set but at the moment it is hard to estimate its size. Besides, it may change based on the system inputs.

                      I will run some tests and I will be glad to share the findings. Thanks again for all of your valuable feedback!


                      Sincerely,
                      /Sergey

                      Comment

                      Working...
                      X