Database problems in MySQL/PHP Applications

PREVIOUS POST
NEXT POST

Article about database design problems is being discussed by Kristian.

Both article itself and responce cause mixed feellings so I decided it is worth commenting:

1. Using mysql_* functions directly This is probably bad but I do not like solutions proposed by original article ether. PEAR is slow as well as other complex conectors. I have not yet tested PDO but would not expect it to beat MySQLi in speed. It is however bad idea to use mysql_ functions directly as well – I would go for using mysqli object approach. The great things about objects is you can easily overload methods and get debugging and profiling tools, as well as have tools which protect you from SQL Injections.

For example I have little wrapper which allows to do $dbcon->query(“Select email from user where name=%s”,$name) – wrapper will detect query is being called with multiple parameters and will perform needed checks and query rewriting. You also can use pretty much direct path to mysqli extension to performance critical queries if you need.

I would also note for many PHP applications abstraction layer is not the main performance problem, also benefit from persistent connections can be much more modest. DVD Store was special type of application which was designed to have very simple logic besides database – in most cases you would have beautiful page rendering as well as much more queries per page which will make performance improvement much smaller. Notable exception being AJAX applications which may do very little work and formating, so database connection may become the issue. Caching should be good help in this case though.

About Consulting – it is worth to mention it was my group which was Dell DVD Store optimization, and I’m now on my own, offering MySQL and LAMP Consuilting Services.

2. Not using auto_increment functionality This is right. With some exception however. For example Innodb tables do internal full table lock if auto_increment is used so using values generated elseware might be faster.

3. Using multiple databases Honestly I do not see application using one database per table that often. I however often see applications using multiple databases to group tables by certain logic, such as you do with directories to group files. I think this makes a lot of sense. Sometimes grouping is done so a lot of databases are needed – for example if grouping is done by user. This might be a bit extreem if you have thousands of users – I would rather do many to many relationship between users and tables but it also might work.

Regarding if you use many tables you’re doing something wrong it is frequently told by people with traditional database background. Things are different with MySQL.

There are many successful applications, using tens of thousands of tables per host and archiving great performance by doing so.
Using multiple tables gives some very important benefits – your data becomes managable, your ALTER TABLE or OPTIMIZE TABLE now locks small table for few seconds rather than giant 100GB table for few hours so can be done pretty much online. You also get good data clustering so table becomes hot very quickly due to data locality once this user starts his queries. It is also much easier to do backup and restore if you need only portion of your data recovered.

There are some performance problems with many tables some are OS and File System dependent, others correspond to Innodb storage engine or using innodb_file_per_table option in particular.

4. Not using relations This one is right one but also with the catch. It is very traditional recommendation to normalize your data however it does not always bring good performance. Joins are expensive and you can often do much better with denormalized data. You may wish to use denormalized data as cached lookup table however so you do not have all these problems with loosing data etc. Read more in my Why MySQL Could be slow with Large Tables article.

5. The n+1 pattern This probably should rather be called Not using Join. This is typical error. On other hand in MySQL you might be better of using several queries than doing complicated ones. Of course you would rather use IN() than do 100 of queries in this case. This most applies to subqueries Where Subselects with IN() become corellated even if they are not, and so using IN() list of values derived by previous query. For example you can do:

Some day this should be fixed however but do not expect it soon.

Use Indexes This item was not in original article, however I think this is the most common mistake and it is very important to fix it. Most applications I have to fix have number of indexing missing which requires queries to do full table scans. Funny enough this is often not the problem in the beginning – if application is bought or custom ordered it frequently can pass customer QA – it will work quite fast with almost empty database. With database growth it will however start to crawl.
So developing you PHP applications use test database with reasonable amount of data in it. And do run EXPLAIN for your queries, especially if you see them in slow query log. If you have trouble understanding EXPLAIN or optimizing your queries remember
we’re here to help.

PREVIOUS POST
NEXT POST

Comments

  1. kb says

    I know – old post but I just felt I had to speak my mind on this one…

    > 1. Using mysql_* functions directly This is probably bad but I do not like
    > solutions proposed by original article ether. PEAR is slow as well as
    > other complex conectors. I have not yet tested PDO but would not
    > expect it to beat MySQLi in speed. It is however bad idea to use
    > mysql_ functions directly as well – I would go for using mysqli object
    > approach. The great things about objects is you can easily overload
    > methods and get debugging and profiling tools, as well as have
    > tools which protect you from SQL Injections.

    I agree – regardless of what database is used, abstracting out core functionality usually means more portable code. You’re right that when used properly, prepare can (if used properly) help increase security against sql injection.

    > For example I have little wrapper which allows to do
    > $dbcon->query(“Select email from user where name=%s”,$name) –
    > wrapper will detect query is being called with multiple parameters
    > and will perform needed checks and query rewriting. You also can
    > use pretty much direct path to mysqli extension to performance
    > critical queries if you need.

    Nice but using prepare/execute is more consistent with what I’ve seen others do. Then again, having your own wrapper function to handle the translations isn’t a bad idea either. I’m not suggesting your way is wrong, but it’s not as common.

    > I would also note for many PHP applications abstraction layer is
    > not the main performance problem, also benefit from persistent
    > connections can be much more modest. DVD Store was special
    > type of application which was designed to have very simple logic
    > besides database – in most cases you would have beautiful page
    > rendering as well as much more queries per page which will make
    > performance improvement much smaller. Notable exception being
    > AJAX applications which may do very little work and formating, so
    > database connection may become the issue. Caching should be
    > good help in this case though.

    95% of the time – yes – I agree. Learning to use query and external caching of result sets is generally a good idea and having pre-compiled summary data ready to go in the database also helps depending on the needs of the application.

    > 2. Not using auto_increment functionality This is right. With some
    > exception however. For example Innodb tables do internal full table
    > lock if auto_increment is used so using values generated elseware
    > might be faster.

    I disagree with you on this completely because InnoDB uses a clustered index based on (in order) the primary key you specify (an auto_increment), a unique key, or one it creates for you. It’s been my experience that failing to use a non-conctextual primary key (NCPK – an auto_increment column) in nearly (emphasize nearly) every InnoDB table can have very serious and negative performance implications. Remember that if you don’t create either a unique key or a column with an auto_increment primary key in a table using InnoDB, the database will create one for you that you can’t use or see. I’ve also seen others try to make InnoDB use a composite UNIQUE key on a table as a primary key taking up far more space than even a BIGINT UNSIGNED will and gaining no benefit from it. I’ve seen developers ask the database to do row lookups by multiple columns of varchars is likely. Don’t do it. I can think of very, very few cases when I would even think about not using a NCPK and most of those demand using some other storage engine like Archive…

    > 3. Using multiple databases Honestly I do not see application using
    > one database per table that often. I however often see applications
    > using multiple databases to group tables by certain logic, such as
    > you do with directories to group files. I think this makes a lot of
    > sense. Sometimes grouping is done so a lot of databases are
    > needed – for example if grouping is done by user. This might be a
    > bit extreem if you have thousands of users – I would rather do
    > many to many relationship between users and tables but it also
    > might work.

    Having a logical separation of data sets helps document relationships of data. There is no reason why a query can’t reference data in two separate databases in the same instance of MySQL. If nothing else, it makes replicating a subset of the data much easier.

    > Regarding if you use many tables you’re doing something wrong it
    > is frequently told by people with traditional database background.
    > Things are different with MySQL.

    Good design will help dictate how many tables are required in order to get the job done. To be good at designing an application, the developer will be wise to consider not only what needs to be stored, but how the data will be retrieved and how it will be used. What kind of summary data will be required? What kind of questions will the application ask of the database?

    > There are many successful applications, using tens of thousands
    > of tables per host and archiving great performance by doing so.
    > Using multiple tables gives some very important benefits – your data
    > becomes managable, your ALTER TABLE or OPTIMIZE TABLE
    > now locks small table for few seconds rather than giant 100GB table
    > for few hours so can be done pretty much online. You also get good
    > data clustering so table becomes hot very quickly due to data locality
    > once this user starts his queries. It is also much easier to do backup
    > and restore if you need only portion of your data recovered.

    While this is true if the goal is to restore a part of the data set, when it’s time to do a full data restore, more tables means longer restore times. Deciding to create a new table for a part of the data set is not a black or white thing. It involves understanding how the data will be saved and used.

    > 4. Not using relations This one is right one but also with the catch. It
    > is very traditional recommendation to normalize your data however it
    > does not always bring good performance. Joins are expensive and
    > you can often do much better with denormalized data. You may wish
    > to use denormalized data as cached lookup table however so you do
    > not have all these problems with loosing data etc. Read more in my
    > Why MySQL Could be slow with Large Tables article.

    Deciding whether or not to let the database maintain referential integrity is just one of the factors that developers should consider when deciding when / how to create indexes. Foreign key references may add processing time to inserts, updates, and deletes but there are other benefits. Also – if you’re replicating from a master to a slave, there is no reason why you can’t remove the foreign key references on the slave(s) as long as you are certain that all the changes are being made on the master and all those changes are making it to the slave(s).

    Like any other decision in database design, one must consider the cost of the benefit versus the value added by having the benefit available. Making a blanket statement that using relations is bad is unwise.

    > 5. The n+1 pattern This probably should rather be called Not using
    > Join. This is typical error. On other hand in MySQL you might be
    > better of using several queries than doing complicated ones. Of
    > course you would rather use IN() than do 100 of queries in this case.
    > This most applies to subqueries Where Subselects with IN() become
    > corellated even if they are not, and so using IN() list of values derived
    > by previous query. For example you can do:

    > SELECT id FROM users WHERE featured=1;

    > Now populate List for IN on your PHP application:

    > SELECT * FROM articles WHERE user_id IN(23,545,654,34)

    > instead of:

    > SELECT * FROM articles WHERE user_id IN (SELECT id FROM users WHERE featured=1)

    Ouch! Use INNER JOIN for this instead because it allows the optimizer to use the indexes to join rows together, potentially preventing a table scan. The trick is to know what kind of JOIN to use so that the right number of rows are scanned and returned.

    SELECT * FROM articles INNER JOIN users ON users.id = articles.user_id AND users.featured = 1 ;

    > Use Indexes This item was not in original article, however I think this
    > is the most common mistake and it is very important to fix it. Most
    > applications I have to fix have number of indexing missing which
    > requires queries to do full table scans. Funny enough this is often
    > not the problem in the beginning – if application is bought or custom
    > ordered it frequently can pass customer QA – it will work quite fast
    > with almost empty database. With database growth it will however start
    > to crawl.
    > So developing you PHP applications use test database with reasonable
    > amount of data in it. And do run EXPLAIN for your queries, especially
    > if you see them in slow query log. If you have trouble understanding
    > EXPLAIN or optimizing your queries remember
    > we’re here to help.

    Partial agreement: “Use Indexes smartly.” Indexing columns that have very low cardinality will often be a waste of resources because the optimizer just won’t use them. Cardinality can be thought of as “how unique or common is what I’m looking for in this table?” An index that has very high cardinality (as high as the number of rows in the table) are potentially very valuable if searches are done against that information. On the other hand, an index that has very low cardinality (1/2 a million rows with an index on a Boolean column for example) is unlikely to be helpful unless you know your data well enough to know you’re looking for rare occurrences of the “weird” setting of the column’s possibilities. Also, indexing the full width of very wide columns will take up tremendous amounts of memory and disk. Because indexes are fast by storing them as fixed-width data (think of converting your varchars to chars for example), then having to do byte comparisons across the full width every time and you understand why I am concerned, especially for composite indexes.

  2. Vadim says

    What is considered as “reasonable amount of data”?
    I understand, that it depends on application,number of tables/columns, but just as rule of thumb, how many rows ensure that I am on safe side?

  3. peter says

    Vadim,

    For test dataset rule is pretty simple – system should behave same as it will behave on production. So generate about same amount of data as you would run on the single box in production. Sometimes you want to scale it down a bit if your test system is low end.

    Basically you want two things to apply – queries should be executed same way on production and test system. Read EXPLAIN should be the same. Second – cache efficiency should be similar. CPU bound workload can’t be compared to disk bound.

  4. says

    On other hand in MySQL you might be better of using several queries than doing complicated ones.

    This is true, however most developers don’t realize that this is *within the database system only*. ie, if you only go to the database once, then go ahead, run as many queries as you want.

    If a system is being developed with a *remote* database in mind, you actually have to find a good balance. If the network back and forth with 100 queries costs more than just doing the complex query in the first place, then it’s faster to do the complex query and only go across the network and back to the database once.

  5. peter says

    Sheery,

    I would not be limiting it to single system. If you’re using several boxes you brobably have them on local network with 1GBit connection between them. This allows you do do many thousands of queries per second from single connection. From multiple connections it will be many tens of thousands per second.

    Network is fast these days, this is why memcached is getting so popular or MySQL Cluster can exist.

    Now you probably do not want to do 100 of the queries instead of one. Please take a close look at my recommendation I recommend using 2-3 queries when MySQL optimizer does not optimize query efficient enough or for some other reason single query requires much more work than separate queries.

  6. suma says

    my website has 6 subdomain and separate databases…. is that the reason why it is slow (im using php mysql with wordpress)

  7. says

    Has anyone seen a situation where mysql reports max
    connections reached and freezes, with a bunch of processes waiting to finish.
    Also, mysql will not shut down unless a manual kill -9 is run.

    Our mysql db contains a mix of innodb and myism tables and run on linux with 16gigs or ram.

    Here is our my.cnf

    # sammple MySQL config file for very large systems.
    #
    #
    # This is for a large system with memory of 1G-2G where the system runs mainly
    # MySQL.
    #
    # You can copy this file to
    # /etc/my.cnf to set global options,
    # mysql-data-dir/my.cnf to set server-specific options (in this
    # installation this directory is /usr/local/mysql/data) or
    # ~/.my.cnf to set user-specific options.
    #
    # In this file, you can use all long options that a program supports.
    # If you want to know which options a program supports, run the program
    # with the “–help” option.

    # The following options will be passed to all MySQL clients
    [client]
    #password = your_password
    port = 3306
    socket = /tmp/mysql.sock
    #tmpdir =/mysql_tmp/
    # Here follows entries for some specific programs

    # The MySQL server
    [mysqld]
    port = 3306
    socket = /tmp/mysql.sock
    bind-address=10.234.94.71
    skip-locking
    key_buffer_size = 2000M
    max_allowed_packet = 32M

    # table_cache=20M
    # open-files-limit=20000

    table_cache = 3072
    open_files_limit = 9216

    tmp_table_size=1000M
    sort_buffer_size = 100M
    read_buffer_size = 100M
    read_rnd_buffer_size = 100M
    myisam_sort_buffer_size = 100M
    max_length_for_sort_data=2048
    max_sort_length=2048
    long-query-time=5
    log-slow-queries=/apps/log/slow-query
    interactive_timeout=300
    wait_timeout=300
    thread_cache = 40
    max_connections=500
    query_cache_size = 2000M
    # Try number of CPU’s*2 for thread_concurrency
    thread_concurrency = 8
    ft_min_word_len=3
    #skip-grant-tables

    # Don’t listen on a TCP/IP port at all. This can be a security enhancement,
    # if all processes that need to connect to mysqld run on the same host.
    # All interaction with mysqld must be made via Unix sockets or named pipes.
    # Note that using this option without enabling named pipes on Windows
    # (via the “enable-named-pipe” option) will render mysqld useless!
    #
    #skip-networking

    # Replication Master Server (default)
    # binary logging is required for replication
    log-bin=db1-bin
    log-bin-index=db1-bin.index

    binlog-ignore-db=chrome_vin
    binlog-ignore-db=dummyData
    #binlog-ignore-db=edmunds
    #binlog-ignore-db=evox
    #binlog-ignore-db=jato
    #binlog-ignore-db=kbb
    binlog-ignore-db=mysql
    binlog-ignore-db=test
    #binlog-ignore-db=us_incentives_extract
    #binlog-ignore-db=vehicles
    #binlog-ignore-db=voiceshot

    # required unique id between 1 and 2^32 – 1
    # defaults to 1 if master-host is not set
    # but will not function as a master if omitted
    server-id = 1

    # Replication Slave (comment out master section to use this)
    #
    # To configure this host as a replication slave, you can choose between
    # two methods :
    #
    # 1) Use the CHANGE MASTER TO command (fully described in our manual) –
    # the syntax is:
    #
    # CHANGE MASTER TO MASTER_HOST=, MASTER_PORT=,
    # MASTER_USER=, MASTER_PASSWORD= ;
    #
    # where you replace , , by quoted strings and
    # by the master’s port number (3306 by default).
    #
    # Example:
    #
    # CHANGE MASTER TO MASTER_HOST=’125.564.12.1′, MASTER_PORT=3306,
    # MASTER_USER=’joe’, MASTER_PASSWORD=’secret';
    #
    # OR
    #
    # 2) Set the variables below. However, in case you choose this method, then
    # start replication for the first time (even unsuccessfully, for example
    # if you mistyped the password in master-password and the slave fails to
    # connect), the slave will create a master.info file, and any later
    # change in this file to the variables’ values below will be ignored and
    # overridden by the content of the master.info file, unless you shutdown
    # the slave server, delete master.info and restart the slaver server.
    # For that reason, you may want to leave the lines below untouched
    # (commented) and instead use CHANGE MASTER TO (see above)
    #
    # required unique id between 2 and 2^32 – 1
    # (and different from the master)
    # defaults to 2 if master-host is set
    # but will not function as a slave if omitted
    #server-id = 2
    #
    # The replication master for this slave – required
    #master-host =
    #
    # The username the slave will use for authentication when connecting
    # to the master – required
    #master-user =
    #
    # The password the slave will authenticate with when connecting to
    # the master – required
    #master-password =
    #
    # The port the master is listening on.
    # optional – defaults to 3306
    #master-port =
    #
    # binary logging – not required for slaves, but recommended
    #log-bin
    # Point the following paths to different dedicated disks
    #tmpdir = /tmp/
    #log-update = /path-to-dedicated-directory/hostname
    tmpdir =/mysql_tmp/:/tmp/
    # Uncomment the following if you are using BDB tables
    #bdb_cache_size = 384M
    #bdb_max_lock = 100000

    # Uncomment the following if you are using InnoDB tables
    #innodb_data_home_dir = /usr/local/mysql/data/
    #innodb_data_file_path = ibdata1:2000M;ibdata2:10M:autoextend
    #innodb_log_group_home_dir = /usr/local/mysql/data/
    #innodb_log_arch_dir = /usr/local/mysql/data/
    innodb_data_home_dir = /db
    innodb_data_file_path = ibdata1:10M:autoextend
    innodb_log_group_home_dir = /db
    innodb_log_arch_dir = /db

    # You can set .._buffer_pool_size up to 50 – 80 %
    # of RAM but beware of setting memory usage too high
    innodb_buffer_pool_size = 8000M
    #innodb_additional_mem_pool_size = 80M
    # Set .._log_file_size to 25 % of buffer pool size
    innodb_log_file_size = 1000M
    #innodb_log_buffer_size = 32M
    #innodb_flush_log_at_trx_commit = 1
    #innodb_lock_wait_timeout = 50

    [mysqldump]
    quick
    max_allowed_packet = 16M

    [mysql]
    no-auto-rehash
    # Remove the next comment character if you are not familiar with SQL
    #safe-updates

    [isamchk]
    key_buffer = 256M
    sort_buffer_size = 256M
    read_buffer = 2M
    write_buffer = 2M

    [myisamchk]
    key_buffer = 256M
    sort_buffer_size = 256M
    read_buffer = 2M
    write_buffer = 2M

    #[mysqlhotcopy]
    interactive-timeout

  8. peter says

    Greg, Can you show processlist so we can understand exactly what you mean as well as specify MySQL version and what kind of binary you’re using. It would be best if you report problem on forum as it is not really related to this blog post.

  9. says

    I have a question in regards to using multiple databases I wanted some advice. I have a jewelry website I am working on and we are switching everything dynamically. Lets say we are selling Jewerly. The user clicks on Jewerly and from Jewerly they click on a brand, lets called it Brand A – under Brand A there are “Rings, Necklaces, Engagement Rings” When a user clicks on “Rings” there is a page that displays sub-categories like “Wedding Band” so each category like Rings, Necklaces, etc, have their own sub=categories of items. How would I go about structering the database? Should I have 1 DB per Vendor (i.e. Brand A, B, C, D)? I am stuck figuring this out because each Vendor A, B, C, D etc have their own MAIN CATEGORIES and in those MAIN CATEGORIES you have sub-categories.

    Thanks so much for your help.

Leave a Reply

Your email address will not be published. Required fields are marked *