Announcement

Announcement Module
Collapse
No announcement yet.

3 way LEFT JOIN with GROUP CONCAT, eliminate using filesort

Page Title Module
Move Remove Collapse
X
Conversation Detail Module
Collapse
  • Filter
  • Time
  • Show
Clear All
new posts

  • 3 way LEFT JOIN with GROUP CONCAT, eliminate using filesort

    Hi everyone

    I'm relatively new to MySQL and gradually learning new stuff, but I'm having a problem with a particular query. I've asked around in the PHP DevNet forums and it seems I should be getting much better performance on this query, as my tables are pretty small and the query's not particularly advanced.
    The structure is nicely normalised, something I've wanted to have for years. I'm gradually switching our company intranet subscribers database from filemaker to MySQL.
    My server is Ubuntu Server 8.04 LTS, MySQL5, PHP5 and Apache2. On a pretty recent Intel Core 2 server with 4GB memory.

    I'd really appreciate it if someone could give me some advice on this, as it's been driving me crazy!

    contacts has about 36,000 records
    companies: 38,000
    tags_data: 42,000
    tags: 980
    My total DB size is about 26MiB, so it's tiny in comparison to many MySQL databases. I think this query should not be taking 32 seconds to complete!

    Basically the tags table has various tags/classifications which can be applied to companies. So you can tag a company as MANUFACTURER, DISTRIBUTOR, SHOW EVENT ORGANISER... etc and there's 980 different tags/classifications to choose from. Each company can have multiple tags applied, and the tags_data table stores which tags are applied to which company.

    Here's a quick diagram of part of the database which should help to visualise

    green bold are the primary key columns. Red are the columns which link the data together.

    Here's the query I'm trying to run:

    SELECT `contacts`.`contact_name`, `companies`.`company_name`, GROUP_CONCAT(`tags`.`name` ORDER BY `tags`.`name` ASC SEPARATOR ',') AS `classifications`FROM `contacts`LEFT JOIN `companies` USING(`company_id`)LEFT JOIN `tags_data` ON `companies`.`company_id`=`tags_data`.`company_id`L EFT JOIN `tags` ON `tags_data`.`tag_id`=`tags`.`tag_id`WHERE `contacts`.`subscribed`='VIP'GROUP BY `contacts`.`contact_id`

    The syntax on that might be improved I guess, but I don't know how to do that. That's the limit of my MySQL knowledge really.

    I'm trying to get contact_name, company_name and a column called classifications which is a comma separated list of the tags/classifications names that are selected for that particular company.
    contacts.subscribed can have 3 different values: NULL, FREE or VIP

    This query only returns 777 records, but I'll be looking to return up to 20,000 with a variation of this query if I can get it running faster!
    Initially this query took about 80 seconds, but I added a composite index (only just learnt about them) on the tags_data table (index called tags_companies) which reduced it to about 30 seconds.
    I removed the ORDER BY or LIMIT from the query earlier as well, but it made very little difference.

    Here's the output of EXPLAIN on this query:

    id select_type table type possible_keys key key_len ref rows Extra1 SIMPLE contacts ref IX_subscribed IX_subscribed 15 const 1103 Using where; Using temporary; Using filesort1 SIMPLE companies eq_ref PRIMARY PRIMARY 4 dwcintranet.contacts.company_id 1 1 SIMPLE tags_data index company_id tags_companies 26 NULL 41930 Using index1 SIMPLE tags eq_ref PRIMARY PRIMARY 4 dwcintranet.tags_data.tag_id 1


    So I think the problem is the Using filesort. I've tried taking the query apart to eliminate the filesort, and it seems it's the 2nd join that's causing the filesort: tags_data ON companies
    But I don't know what else I could index to sort this out

    Any ideas/suggestions?
    I would really appreciate some help on this, I'm convinced that this query shouldn't be taking 30 seconds!

    Thanks, Ben

  • #2
    Here's the SHOW CREATE TABLE of the 4 tables

    contacts

    CREATE TABLE `contacts` ( `contact_id` int(11) NOT NULL auto_increment, `company_id` int(11) default NULL, `address1_contact` varchar(255) default NULL, `address2_contact` varchar(255) default NULL, `address3_contact` varchar(255) default NULL, `business_card` varchar(2) default NULL, `contact_id_old` int(20) default NULL, `contact_name` varchar(255) default NULL, `county_state_contact` varchar(100) default NULL, `create_name` varchar(60) default NULL, `create_stamp` int(10) default NULL, `email_contact` varchar(255) default NULL, `email_contact_optout` tinyint(1) default NULL, `fax_contact` varchar(255) default NULL, `geo_contact` varchar(255) default NULL, `job_title` varchar(255) default NULL, `subscribed` varchar(4) default NULL, `mail_contact` tinyint(1) default NULL, `modify_count` int(7) default '0', `modify_name` varchar(60) default NULL, `modify_stamp` int(10) default NULL, `note` text, `postcode_zip_contact` varchar(50) default NULL, `reg_stamp` int(10) default NULL, `tel_cell` varchar(255) default NULL, `tel_contact` varchar(255) default NULL, `tel_ext` varchar(255) default NULL, `tel_home` varchar(255) default NULL, `temp` varchar(50) default NULL, `town_city_contact` varchar(100) default NULL, `view_count` int(7) default '0', `view_name` varchar(60) default NULL, `view_stamp` int(10) default NULL, PRIMARY KEY (`contact_id`), KEY `company_id` (`company_id`), KEY `contact_name` (`contact_name`), KEY `IX_subscribed` (`subscribed`), KEY `companies_contacts` (`company_id`,`contact_id`), KEY `contacts_companies` (`contact_id`,`company_id`)) ENGINE=MyISAM AUTO_INCREMENT=37645 DEFAULT CHARSET=utf8



    companies

    CREATE TABLE `companies` ( `company_id` int(11) NOT NULL auto_increment, `ad_lead` tinyint(1) default NULL, `address_priv` tinyint(1) default NULL, `address1` varchar(255) default NULL, `address2` varchar(255) default NULL, `address3` varchar(255) default NULL, `advisory` varchar(255) default NULL, `advisory_name` varchar(60) default NULL, `advisory_stamp` int(10) default NULL, `agency_address` varchar(255) default NULL, `agency_company` varchar(255) default NULL, `agency_contact` varchar(255) default NULL, `agency_email` varchar(255) default NULL, `agency_fax` varchar(255) default NULL, `agency_id` int(11) default NULL, `agency_tel` varchar(255) default NULL, `budget_year1` varchar(3) default NULL, `budget_year2` varchar(3) default NULL, `company_name` varchar(255) default NULL, `county_state` varchar(100) default NULL, `create_name` varchar(60) default NULL, `create_stamp` int(10) default NULL, `currency` varchar(3) default NULL, `dir_contact` varchar(255) default NULL, `email` text, `email_optout` tinyint(1) default NULL, `email_priv` tinyint(1) default NULL, `fax` varchar(255) default NULL, `fax_priv` tinyint(1) default NULL, `geo` varchar(2) default NULL, `hanging_file` tinyint(1) default NULL, `iso_code` varchar(20) default NULL, `locked_name` varchar(60) default NULL, `locked_stamp` int(10) default NULL, `master_id` int(11) default NULL, `modify_count` int(7) default '0', `modify_name` varchar(60) default NULL, `modify_stamp` int(10) default NULL, `postcode_zip` varchar(50) default NULL, `rep` varchar(10) default NULL, `tel` varchar(255) default NULL, `tel_priv` tinyint(1) default NULL, `town_city` varchar(255) default NULL, `vat_number` varchar(20) default NULL, `view_count` int(7) default '0', `view_name` varchar(60) default NULL, `view_stamp` int(10) default NULL, `website` text, PRIMARY KEY (`company_id`), KEY `geo` (`geo`), KEY `company_name` (`company_name`)) ENGINE=MyISAM AUTO_INCREMENT=39200 DEFAULT CHARSET=utf8


    tags_data

    CREATE TABLE `tags_data` ( `tagdata_id` int(11) NOT NULL auto_increment, `tag_id` int(11) default NULL, `company_id` varchar(6) default NULL, `create_name` varchar(60) default NULL, `create_stamp` int(10) default NULL, PRIMARY KEY (`tagdata_id`), KEY `tag_id` (`tag_id`), KEY `company_id` (`company_id`), KEY `tags_companies` (`tag_id`,`company_id`)) ENGINE=MyISAM AUTO_INCREMENT=42205 DEFAULT CHARSET=utf8


    tags

    CREATE TABLE `tags` ( `tag_id` int(7) NOT NULL default '0', `category` varchar(255) default NULL, `description` varchar(255) default NULL, `name` varchar(255) default NULL, PRIMARY KEY (`tag_id`), KEY `category` (`category`), KEY `name` (`name`)) ENGINE=MyISAM DEFAULT CHARSET=utf8


    I'd really appreciate it if someone could take a look through to see how I can speed this up!

    Thanks so much, Ben

    Comment


    • #3
      I think your problem is here:


      1 SIMPLE tags_data index company_id tags_companies 26 NULL 41930 Using index


      The previous line should produce a set of company ids used to join into tags_data. Unfortunately, MySQL is doing an index scan for every row produced from the previous step. Basically, MySQL is evaluating 41930 * 1103 rows to produce your result set. That's over 40M, which explains why things are taking so long.

      I think you can clear this up by eliminating the left join from tags_data into tags. Since this is a join table, you really shouldn't have orphaned rows. If you do, they should be easy to clean up.

      You may also want another covering index on tags_data, on (company_id, tag_id). This will be used in joins from companies to tags. Your existing index only helps when joining from tags to companies. Since you're using MyISAM, and only indexes are cached, having (company_id, tag_id) instead of just company_id should help things move faster as well.

      Comment


      • #4
        Thanks for the reply.
        I figured that's what was happening behind the scenes.

        I think I need the LEFT JOIN from tags_data into tags, because I want to access the tags.name field in my GROUP CONCAT()

        Do I still need that particular LEFT JOIN to accomplish that?

        Cheers, B

        Comment


        • #5
          No, an inner join there should be fine. Basically, with the left join from tags_data to tags, you're making sure rows from tags_data without matching rows in tags are included. I can't think of any reason why a row from tags_data has a tag_id which is not present in tags.

          Right now, here's the degenerate rows that can be returned by your query:

          A valid contact with no company and no tags
          A valid contact with a valid company and no tags
          A valid contact with a valid company, rows in tags_data but no tags

          If you're certain this is what you want, then your left joins are correct. Otherwise, convert the left joins to inner joins and you'll save yourself some processing.

          Comment


          • #6
            Thanks for the reply.

            I just changed that tags on tags_data.tag_id=tags.tag_id into a regular JOIN, rather than LEFT JOIN
            And it ran much quicker, but it returned fewer results.

            I do want to return companies even if they have no tags selected in tags_data. So I guess that has to stay a LEFT JOIN.

            I added that extra covering index between tags_data and companies.
            So now I have 2 on there: company_id, tag_id and tag_id, company_id

            But the query takes about the same time.

            Here's an updated EXPLAIN output after adding that 2nd index, but all still as LEFT JOINS as before:

            id select_type table type possible_keys key key_len ref rows Extra1 SIMPLE contacts ref IX_subscribed IX_subscribed 15 const 1103 Using where; Using temporary; Using filesort1 SIMPLE companies ref PRIMARY,company_id_geo company_id_geo 4 dwcintranet.contacts.company_id 1 1 SIMPLE tags_data index company_id,companies_tags tags_companies 26 NULL 41930 Using index1 SIMPLE tags eq_ref PRIMARY,tag_id_name PRIMARY 4 dwcintranet.tags_data.tag_id 1


            Any idea on what else I can try indexing or changing?
            Would it be quicker to use subquery/subselect instead of 3 joins? I've never tried using those before so not sure if that's possible.

            Thanks for all your help so far!
            Cheers, B

            Comment


            • #7
              I'd be curious to see the results of this:


              EXPLAIN SELECT companies.company_name, tags_data.*FROM companies LEFT JOIN tags_data USING(company_id)


              If the plan for that query chooses the (tag_id, company_id) index, then I'd try running ANALYZE TABLE to make sure the index statistics are up to date.

              Comment


              • #8
                I ran that EXPLAIN and here's the output:


                id select_type table type possible_keys key key_len ref rows Extra1 SIMPLE companies ALL NULL NULL NULL NULL 36557 1 SIMPLE tags_data ALL company_id,companies_tags NULL NULL NULL 41930


                So it looks like it's using the company_id,tag_id index (companies_tags).

                Comment


                • #9
                  No index is being used according to that plan, which means that given how much data you have, MySQL thinks a table scan is faster than using the index. It could have used one of those indexes, but it decided not to.


                  See what the explain looks like for this:


                  SELECT tags_data.company_id, GROUP_CONCAT(`tags`.`name` ORDER BY `tags`.`name` ASC SEPARATOR ',') AS `classifications` FROM tags_data INNER JOIN tags USING(tag_id) GROUP BY tags_data.company_id



                  If that's not terrible (the product of the rows columns is not huge), you can try using it to eliminate the last left join.


                  SELECT `contacts`.`contact_name`, `companies`.`company_name`, company_tags.classificationsFROM `contacts`LEFT JOIN `companies` USING(`company_id`)LEFT JOIN ( SELECT tags_data.company_id, GROUP_CONCAT(`tags`.`name` ORDER BY `tags`.`name` ASC SEPARATOR ',') AS `classifications` FROM tags_data INNER JOIN tags USING(tag_id) GROUP BY tags_data.company_id ) AS company_tagsON companies.company_id = company_tags.company_idWHERE `contacts`.`subscribed`='VIP'


                  However, I'm not sure this is a good idea, since that inner select doesn't get any benefit from the restriction applied to contacts and therefor companies. It will always load all your tags, which might be fine if that's the common case.

                  I also didn't try either of these myself, so you might have to clean up any syntactic problems I have.

                  The other thing you should consider since you're implementing tags and you're already using MyISAM, is using fulltext search. You'd end up just adding a text column to companies that contains a comma separated list of tags and creating a fulltext index on that column.

                  The query you're trying to write will then stop after the left join into companies, and your classifications will already be available to you without any grouping. As an added bonus, searching by tags, especially a combination of tags, can be done through the fulltext index instead of some awful looking query which joins the tags table over and over.

                  You'll need to change your code to maintain this column, which is probably best done outside of SQL in whatever language you're already using for your application.

                  Comment


                  • #10
                    Hi vgatto

                    I tried out your suggestion, and it returned the desired records quickly. However it didn't seem to scale that well for larger datasets, but I'll be keeping that in the bank.

                    What's the terminology for that type of query?
                    Is that a sub-select?

                    I did actually solve my problem with the 3 LEFT JOINs.
                    On comp.databases.mysql someone noticed that I'd made an error in my table structure.

                    companies.company_id was INT(11)
                    tags_data.company_id was VARCHAR(6)

                    A while ago I did have company_id as VARCHAR(6) for compatibility with our filemaker solution, but I changed and forgot to change it in tags_data
                    Making that one change made the original query run at 0.12 seconds, rather than 32 seconds!!!

                    Thanks so much, B

                    Comment

                    Working...
                    X