AaronCrane.co.uk

Numbers and strings in MySQL

Here’s a MySQL query issued by some Perl code:

$db->do(q{
    CREATE TEMPORARY TABLE byline_age (
        byline_id int PRIMARY KEY,
        last_used int not null
    )
    SELECT byline_id, MAX(IfNull(published_at, ?)) AS last_used
    FROM article
    GROUP BY byline_id
}, undef, time());

Spot the bug? No, nor did I.

The bug is in the use of an untyped placeholder (the ? in the query string). When Perl’s DBD::mysql fills in that placeholder, it quotes the value passed in. However, the use of quotes forces the IfNull() function to return a string (not a number), which causes the MAX() aggregation to use string comparison rather than numeric comparison, which causes numbers in the 900-million range to be considered greater than numbers in the 1-billion range. That is, the quotes cause this MAX() to produce what is almost exactly the wrong value.

The underlying bug here is the bug that PHP and MySQL share when compared to Perl or Python or Ruby or C++ or Java. In Perl, numeric and string values are essentially interchangeable, but you have to say what you mean when comparing things. In Python and most other languages, you can’t do a mixed-type comparison of strings and numbers; you have to explicitly convert one side to a type compatible with the other, and then the single comparison operation can do the only possible thing. Of those, I prefer the Perl approach, but either way permits you to say exactly what you mean, and get the expected result.

MySQL, on the other hand, adopts neither approach. There’s only one comparison operator, but it accepts mixed-type operands, and just guesses what sort of comparison you wanted. That means it necessarily guesses wrong in some cases — and in this case in particular. If MySQL had refused to execute my query, I’d probably have emitted an annoyed grunt, then changed my query to ensure the placeholder value was explicitly converted to an integer. But because MySQL just silently ran my query in such a way as to do the wrong thing, I ended up with broken data, and I was lucky to spot it.

Here’s a simple rule: require the user either to specify the nature of the operation, or to ensure that the inputs are of compatible types. Anything else is a source of silent bugs. Users who aren’t competent to say what they mean are not competent to be writing software; pandering to them can only lead to making it harder for competent users to write correct software.