Announcement

Announcement Module
Collapse
No announcement yet.

table to table matching

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • table to table matching

    Hi, I didn't really find similar cases:

    I have 2 tables (in 2 different databases, if that makes a difference).

    These tables have an equal number of fields, named the same, eg:

    db1.table1:
    id, name, info1, info2

    db2.table1:
    id, name, info1, info2

    I need to match every field of every column with the other's table column.
    I need to find out which field is equal to which and if all fields of one row are NOT equal to *any* another row, then i display it. id is the primary key and auto incremented. Note that, I only want to know the difference against table1 from table2, so LEFT joining is ok (instead of some OUTER join)


    So, i'm doing that:

    SELECT db1.table1.id as cid FROM db1.table1 LEFT JOIN db2.table1 ON (db1.table1.name = db2.table1.name) AND (db1.table1.info1 = db2.table1.info1) AND (db1.table1.info2 = db2.table1.info2) WHERE db2.table1.id IS NULL

    As you can see its a bit complex already.
    Over very large tables however this query takes *forever*.
    Of course, by nature, I have to compare each field's row to *every single other field's row* to determine is this very row already exist in the other table, so its quite resource consuming.

    Anyway, the performance of this is terrible, any idea how to make it faster? (ps entries are inserted once, then only selected so indexs are ok)

    I made a test with 150 000 rows, it took so long i had to cancel it. It's pretty fast over 30-50 rows

  • #2
    Why not to use mk-table-checksum (from http://www.maatkit.org/)? )

    Comment


    • #3
      while this tool is very fast, it doesn't do what i need.
      however, i might write my own tool, if i do not find a way to make it fast with mysql queries.

      this tool will checksum the whole table or whole table column then you compare checksums.

      What i do instead (and this is what my query does currently), is to check every individual field.

      Eg:

      table 1:
      id name info
      1 bob enginer
      2 jim ceo
      3 albert janitor
      4 sonia reception

      table 2:
      id name info
      1 bob technic
      2 jim ceo
      3 john janitor
      4 sonia reception

      the tables have 2 differences, at id 2 and 3, my query will output:
      2, 3

      In more details, the values are actually hashs that i compare, and some are so-called fuzzy-hash, that is, hashs that represent similar data, e.g. john and jhon (typo) would match. This is done through a mysql udf (user defined function, aka a library i programmed)


      To my understanding the LEFT JOIN is actually internally joining a full table to every field it will check, like:
      check entry 'bob' against the whole table 2
      check entry 'enginer' against the whole table 2
      and everytime load table 2 in memory.
      and so on. of course, this is overkill and very slow. (or maybe im wrong and it doesnt do that)

      I would only need to load the table 2 once in memory and check against it. Or, when memory is not sufficient (eg table is super large), split it.
      Or use any kind of query that would do that efficiently (which is what i'm looking for, i believe)

      Comment


      • #4
        bilboa wrote on Tue, 02 September 2008 06:40

        while this tool is very fast, it doesn't do what i need.



        Well, try one of these database tools, I think the compare tool will be the best for you.

        Good luck, let me know hows it going.

        Comment

        Working...
        X