| Subcribe via RSS

The Dangers of Having status fields

August 27th, 2008 | 1 Comment | Posted in MySQL, MySQL Performance

Having a status column is very common in databases today. It can be used to denote a user status:

CREATE TABLE IF NOT EXISTS `user` (
`user_id` int(10) unsigned NOT NULL auto_increment,
`email` varchar(32) NOT NULL,
`pw_hash` char(40) NOT NULL COLLATE latin1_general_cs,
`status` enum('PENDING', 'ACTIVE', 'DISABLED') default 'PENDING',
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
PRIMARY KEY (`user_id`),
UNIQUE KEY `idx_email` (`email`)
);

or user-uploaded media status:

CREATE TABLE IF NOT EXISTS `media` (
`media_id` int unsigned NOT NULL auto_increment,
`owner_user_id` int(10) unsigned NOT NULL,
`title` varchar(32) NOT NULL,
`path` varchar(255) NOT NULL,
`description` varchar(128) NOT NULL,
`status` enum('PROCESSING', 'ACTIVE', 'FAILED', 'DISABLED') default 'PROCESSING',
`date_created` timestamp NOT NULL default CURRENT_TIMESTAMP,
PRIMARY KEY (`media_id`),
KEY `idx_owner` (`owner_user_id`)
);

Let’s say that we needed to retrieve a list of media associated with users. This can be easily accomplished with the following query:

SELECT SQL_NO_CACHE
`u`.`user_id`,
`m`.`media_id`
FROM `user` AS `u`
JOIN `media` AS `m` ON (`u`.`user_id` = `m`.`owner_user_id`);

If there are status columns involved, the query then becomes:

SELECT SQL_NO_CACHE
`u`.`user_id`,
`m`.`media_id`
FROM `user` AS `u`
JOIN `media` AS `m` ON (`u`.`user_id` = `m`.`owner_user_id`)
WHERE `u`.`status` = 'ACTIVE'
AND `m`.`status` = 'ACTIVE';

If non-active records are removed, the status columns dropped, and the first query run, there is a 15% increase in qps over the second query on the original tables in my test environment. The difference can becomes even more pronounced as the number of tables referenced in a JOIN increases!

If you must have status fields, choose the right data type. But if they can be avoided by archiving disabled records to “cold” databases and keeping “pending” records separate (say, until a user completes email verification), that is generally a better solution.

Tags: ,

Must we always escape values?

August 18th, 2008 | 6 Comments | Posted in MySQL, MySQL Performance, PHP

One of the cardinal rules of writing web applications is to escape user-generated input with functions like PHP’s real_escape_string. This is a great rule, but one that can have a negative impact on your application’s performance if used unnecessarily. For instance, when querying data with an integer parameter that is passed internally (not user-generated):

$query = "SELECT SQL_NO_CACHE * FROM `user` WHERE `user_id` = '" .
$mysqli->real_escape_string ( self::$user_id ) . "'";
$res = $mysqli->query ( $query );

The above code takes an average of 0.000922918319702 seconds to execute.

Whereas:

$query = "SELECT SQL_NO_CACHE * FROM `user` WHERE `user_id` = " . self::$user_id;
$res = $mysqli->query ( $query );

takes an average of only 0.000418901443481 seconds to execute.

Although the improvement is small (~0.0005 seconds), when your site runs millions (or tens-of-millions) of queries per day, the benefits begin to add up.

Tags:

The Query Performance Improvement Process

August 3rd, 2008 | No Comments | Posted in MySQL, MySQL Performance

The purpose of this post is to outline a general flow-chart for improving the performance of queryies in MySQL. Much has been written on using EXPLAIN to optimize queries, but there is a whole process that should be followed in order to maximize the effectiveness of query performance tuning. Following is a visual flow-chart of the process:

The Query Performance Improvement Process

Assuming that the problem query has been identified, the first question to be asked is:

1) Can the query be gotten rid of?

Surprisingly, the answer is often “yes”. As a result of the Rapid Application Development paradigm followed by many “Web 2.0″ companies, class interfaces are constantly changing and a data-set that used to be required, could no longer be. Equally as often, the query could be rolled up into another query with minimal effect on the other query’s performance. If the query can’t be gotten rid of or the data retrieved elsewhere,

2) Can the query be changed to select a smaller dataset with a more selective WHERE clause or eliminating columns that are returned but not used from the SELECT clause?

Some developers are trained to use SELECT * (link goes to an explanation of why “SELECT *” is evil) in all of their queries under the guise of “flexibility”, using perhaps only one or two columns. This can bog down the network if the unneeded columns are large, and eliminate the possibility of using covering indexes that could eliminate the need to read from disk when retrieving the result set. Additionally, a result can be made smaller by using a LIMIT clause if only a known few of the records will be used by the application. If neither of these approaches work,

3) Can the query be re-written?

If the query has a SUBQUERY, it might have more desireable performance characteristics if it were to be rewritten as a JOIN (tutorial here as well). Although the query re-writing process is outside the scope of this post, there are many good tutorials available elsewhere online.

4) Can indexes be added or changed?

Most people jump the gun and add/change indexes first, before trying to re-write the query. I do not suggest this because there are downsides to having too many indexes (slower INSERTs, poor use of memory, etc…). Some good rules-of-thumb for indexes are:

  • Index columns that are used in JOINs (typically Foreign Keys)
  • Use UNIQUE indexes when possible
  • Find places where you can use covering indexes
  • Ensure adequate selectivity on indexed columns
  • Use small datatypes (int vs. varchar)
  • Keep track of your data (selectivity can change over time)

5) Can the schema or code be changed?

Sometimes fixing a query is as simple as ensuring that joined columns have the same datatype or creating summary tables so that calculations are not performed each time the query is executed. More often, however, it can require that entire sections of code are re-written and data structures changed. This step is last in the list because it is usually the most time-consuming and error-prone.

Cache The Result

There are those who feel that caching a result (using memcached, for example) is a solution for badly-performing queries. I am not one of those people. If your cache servers die, for instance, wouldn’t you rather have the result be that the application performs more slowly, rather than becoming completely unresponsive? Only after your queries have achieved an acceptable level of performance should you explore the option of caching the results.

If you have gone through the entire process and been unable to fix the query, I would encourage you to repeat the process with a fresh set of eyes.

Tags: ,

Using Nagios as a MySQL Performance Profiler

June 17th, 2008 | 2 Comments | Posted in MySQL, MySQL Performance

Everybody knows than Nagios can be used as a service monitor to monitor things like Load Averages, MySQL Replication Status, RAID Array States, etc… Fewer know that there are plug-ins to monitor MySQL Performance Status, such as check_mysql_perf. Fewer still utilize Nagios’ built-in triggering mechanism to execute an additional script on the event of a critical alert.

It is not uncommon to experience a load spike in the middle of the night, only to discover that it immediately cleared itself. Many times, retroactive log analysis will not reveal anything out of the ordinary. In order to get an immediate snapshot of what is going on at the time of a Nagios alert, simply call a script via the event handler similar to this one:

#!/usr/local/bin/bash

Usage() {
echo 1>&2 "Usage: $0 -h <hostname> -u <username> -p <password> -s <state> -t <state_type>"
exit 1
}

MYSQL="/usr/local/bin/mysql"
MAIL="/usr/bin/mail"
RECIPS="dba@domain.com"
PERL="/usr/bin/perl"
[ $# -ne 10 ] && Usage

while getopts "h:u:p:s:t:" option; do
case $option in
h ) HOSTNAME="$OPTARG";;
u ) USERNAME="$OPTARG";;
p ) PASSWORD="$OPTARG";;
s ) STATE="$OPTARG";;
t ) STATE_TYPE="$OPTARG";;
* ) Usage;; # DEFAULT
esac
done

[ $STATE = "CRITICAL" -a $STATE_TYPE = "HARD" ] || exit 0
$MYSQL -h$HOSTNAME -u$USERNAME -p$PASSWORD --connect_timeout=10 -ss -e'SHOW FULL PROCESSLIST' | $PERL -pe 's/\\n/\n/g' | $MAIL -s"$HOSTNAME post-alert processlist" $RECIPS
exit 0

This will email dba@domain.com with the output of SHOW FULL PROCESSLIST;

This type of profiling is especially useful because it can easily be turned on and off without affecting your production site (and it is very inexpensive to execute). Don’t forget to grant your Nagios user the PROCESS privilege!

Tags: , ,