home :: computers :: databases :: mysql :: replace considered harmful

MySQL's REPLACE Considered Harmful

So we’ve set up a client with an online poll application using MySQL. Polls are created in Bricolage, and when they’re published, rather than writing data to files, the template writes data to the MySQL database. PHP code on the front-end server then uses the database records to manage the polls.

On the recommendation of one of my colleagues, I was using the MySQL REPLACE statement to insert and update poll answers in the database. At first, this seemed like a cool idea. All I had to do was create a unique index on the story_id and ord (for answer order) columns and I was set. Any time someone reordered the answers or changed their wording in Bricolage, the REPLACE statement would change the appropriate records and just do the right thing.

Or so I thought.

Come the day after the launch of the new site, I get a complaint from the customer that the percentage spread between the answers doesn’t add up to 100%. After some investigation, I realized that the poll_results table is using the ID of each question to identify the votes submitted by readers. This makes sense, of course, and is excellent relational practice, but I have overlooked the fact that REPLACE essentially replaces rows every time it is used. This means that even when a poll answer hasn’t changed, it gets a new ID. Yes, that’s right, its primary key value was changing. Yow!

Now we might have caught this earlier, but the database was developed on MySQL 3.23.58 and, as is conventional among MySQL developers, there were no foreign key constraints. So the poll results were still happily pointing to non-existent records. So a poll might appear to have 800 votes, but the percentages might be counted for only 50 votes. Hence the problem with the percentages not adding up to 100% (nowhere near it, in fact).

Fortunately, the production application is on a MySQL 4.1 server, so I made a number of changes to correct this issue:

  • Added foreign key constraints
  • Exploited a little-known (mis)feature of Bricolage to store primary keys for all poll answers (and questions, for that matter)
  • Switched from REPLACE to INSERT, UPDATE, and DELETE statements using the primary keys

I also started using transactions when making all these updates when a poll is published so that changes are always atomic. Now it works beautifully.

But the lesson learned is that REPLACE is a harmful construct. Yes, it was my responsibility to recognize that it would create new rows and therefore new primary keys. But any construct that changes primary keys should be stricken from any database developer’s toolbox. The fact that MySQL convention omits the use of foreign key constraints makes this a particularly serious issue that can appear to have mysterious consequences.

So my advice to you, gentle reader, is don’t use it.

Comments & Trackbacks

Greg Fortune wrote:

Yes, it was my responsibility to recognize that it would create new rows and therefore new primary keys.

Yup :) You might consider using the ON DUPLICATE KEY UPDATE syntax for insert as it will do exactly what you wanted. The docs are here. Note that the user comments on the REPLACE page mention ON DUPLICATE KEY several times.

There are actually cases in which REPLACE is useful, but clearly not in the case you describe.

Theory wrote:

Greg,

Interesting. I wasn't aware of the ON DUPLICATE KEY UPDATE syntax for INSERT statements. Interesting. Would it respect the primary key, since I don't specify it in an insert statement (I let the AUTOINCREMENT do its thing)?

—Theory

James Day wrote:

The (current) MySQL 4.1 and 5.n manuals seem pretty clear, with this as the first paragraph of the description:

REPLACE works exactly like INSERT, except that if an old record in the table has the same value as a new record for a PRIMARY KEY or a UNIQUE index, the old record is deleted before the new record is inserted.

Do you have any suggestions for improving how REPLACE is documented? What would have caused you to do a double-take if you'd read it before coding?

INSERT .. ON DUPLICATE KEY UPDATE seems unlikely to work unless you have a second unique key which contains something to match the existing record. Absent that, your choice to use an autoincrement field as the primary key is going to bite you again.

I do wonder why you're using an autoincrement field as the primary key in this situation. It doesn't appear to match the logic I'd expect from the description of the application, since the poll IDs (if they exist) and question IDs appear to be fixed and used in the table containing the poll description and answer options. Sounds as though those and some identifier for the end user giving the answer would be a more suitable primary key choice for the results table.

Theory wrote:

James

Do you have any suggestions for improving how REPLACE is documented? What would have caused you to do a double-take if you'd read it before coding?

I'd add something like this after the sentences you quoted:

Warning! If the the match between the new and old records is based on a UNIQUE constraint and you also have an AUTOINCREMENT. primary key, because REPLACE deletes and inserts a new record, the new record will have a new primary key!. If you don't want the value of your AUTOINCREMENT primary key to potentially break foreign key relationships, then use foreign key constraints with ON DELETE RESTRICT to prevent problems.

IOW, I would have done a double-take if it had said, This may change the value of your primary key.

INSERT .. ON DUPLICATE KEY UPDATE seems unlikely to work unless you have a second unique key which contains something to match the existing record.

I did. Perhaps I should have just made it the primary key. But I like my primary keys to be surrogate keys.

I do wonder why you're using an autoincrement field as the primary key in this situation. It doesn't appear to match the logic I'd expect from the description of the application, since the poll IDs (if they exist) and question IDs appear to be fixed and used in the table containing the poll description and answer options. Sounds as though those and some identifier for the end user giving the answer would be a more suitable primary key choice for the results table.

That's what I ended up doing. I just had to find a place to store the primary keys in the application that was updating and inserting these records. This is the correct way to do it: If you have an ID, do an UPDATE; otherwise, do an INSERT.

—Theory

Lukas wrote:

Well REPLACE requires that you actually set your primary key explicitly inside the query. I think this is rather obvious from the current docs. However you will need the same information to do an UPDATE so I dont really understand the issue at hand.

Theory wrote:

Lukas

My REPLACE query was not using or referring to the primary key at all. It was keying off of a different column with a unique index. I was letting the AUTOINCREMENT set the primary key value, and that's where the problem came from.

—Theory

Powered by KinoSearch