mysql replication


Brad‘s mysqlcon presentation has me thinking out some of the ways we (mis)use MySQL
replication at Mitel. Most of it comes down to working on elegance — in a lot of places I think I’m doing things that work as intended, but are pretty far away from the Right Way. So now I’m trying a whole bunch of things to build confidence in the Right Way of doing things.


In doing so I have come to the following conclusions:

  • For databases of non-trivial size and indices, mysqldump
    is not useful, because rebuilding the indices takes ridiculously
    long compared to any solutions that involve copying around raw
    data (replication, mysqlhotcopy, stop/cp/start).

  • For replication over a WAN rather than a LAN, LOAD DATA FROM
    MASTER
    is not useful because it takes so long and so much
    bandwidth compared to locking, taking a local snapshot, unlocking,
    rsyncing to the slave, and resetting master and slave binlogs.

Thoughts? For the record, we’re still all MyISAM, with our largest
tables about 1.5GB (and the whole database about 5GB) and sites
connected via a ~2Mb VPN over the Internet.


9 responses to “mysql replication”

  1. I thought hotcopy was pretty locky and sqldump-ish.

    Just replicate to a dummy machines for backups, and also use the dummy machine as a machine you can use to start up fresh slaves. So you have a new machine on the WAN that needs to be brought into the replication cluster: rsync the dummy data to it, along with the master.info, and start it up.

    I’ve never used LOAD DATA FROM MASTER. They keep talking about making it not suck, but I haven’t heard any progress on it in a while.

  2. Amen to that, brother. Lock the tables. Copy the data. Unlock the tables. Move the data. Start the slave. Repeat. There is no other good way, I think.

    I just spent this past weekend building and rebuilding slaves, checking on replication status, tweaking indices, etc. I’ve come to these conclusions:

    1. Machine loads in the 300s affect replication pretty seriously
    2. Maybe we should STOP THAT
    3. We finally stopped that.
    4. Replication on MySQL is really pretty good, for the price
    5. …especially considering the (free) competition, which really isn’t in this case.

    We’re on MyISAM as well; InnoDB looks tasty but the headaches involved in replication with InnoDB (at least for free) are not as nice. Your setup sounds eerily similar to ours, even in terms of the database size I was dealing with.

    Hey, while we’re on the topic, do you find MySQL replication randomly breaking on you? How do you monitor your slaves? Oh, and what version are you running?

  3. I always saw LOAD DATA FROM MASTER as really intended for very small tables. I have always wondered, though; if your tables are that small, what’s replication doing on your menu?

    I like the dummy/backup machine idea. Pity we don’t have machines for that, but it’s a fantastic idea….

  4. It’s struck me as something that had to be there for some general sense of completeness (“What, you can only sync up a slave out-of-band? We need to fix that, practicality be damned.”)

    I think the mysqldump problem is a different cause to the same end. Yes, there needs to be a tool to go from a database to a bunch of SQL that can recreate the database, but that doesn’t mean that you want to do that.

    On the Big Database side, one common practice seems to be backing up snapshots infrequently and redo logs regularly, which is a practice you don’t hear about much on the OSS database side of things. (I doubt many think about MySQL binlogs as “redo logs”.) If I had tape onsite I’d think about that, but tape is a few online backups away which makes things that amount to “just keep writing this to tape indefinitely” a bit inconvenient.

  5. What’s wrong with MySQL replication, as far as you’re concerned?

    And what’s so hard/different about replicating InnoDB? I’ve replicated every combination of {MyISAM,InnoDB} to {MyISAM,InnoDB} without problems. (well, obviously can’t do InnoDB to MyISAM if preserving transactions…)

    How do you find MySQL randomly breaking? With what errors?

  6. mysqlhotcopy is locky, but anything that copies the full set of data out while maintaining something vaguely resembling integrity while still keeping the whole lot on disk has to be locky. It’s not sqldumpish, though — it’s copying the actual files out of the datadir (which is why it’s no good with InnoDB). It’s also thoroughly frightening perl (although it’s Tim Bunce’s, and later Jeremy Zawodny’s, thoroughly frightening perl), and doubly so once you make it work with rsync, but it works for taking snapshots.

    I like the dummy machine idea, and we have such a machine already, but I can’t go introducing an extra site into the picture — the site with the dummy machine is the unreliable one that led the live machines to be colocated elsewhere (one with some yahoos in Fisher Plaza, incidentally.)

    Mostly I was having a hard time convincing myself that these two well-documented and well-established features were both impractical and that I should stop paying attention to them. :-)

    Nonetheless I think I’ve managed to get rid of some messy manual steps in our process tonight. There was a great deal of wheel-reinventing going on here.

  7. It randomly broke on us for long enough that we stopped trusting built-in things and started reinventing wheels, which is what I’m trying to undo, because the breakage seems to have stopped since we moved from 3.x to 4. We’re running 4.0.14 across the board right now.

    We’ve tried a few different ways of monitoring slaves in the past. Our usage means that there is a guaranteed constant stream of updates, and that if there isn’t, the replication alert will be lost in all of the other significantly more important alerts we’re getting. :-) Right now we’re watching that Read_Master_Log_Pos and Relay_Log_Pos are incrementing or rolling over and that Slave_IO_Running and Slave_SQL_Running are “Yes”; in the past we’ve watched the questions counter in “mysqladmin stat” with some success, but wanted the finer granularity (particularly since some experimental maintenance tasks occasionally involve stopping SQL but not IO).

    Incidentally — and mostly for my own reference since I’m about to go to bed and I need to remember this tomorrow — what I ended up with for bringing a broken slave back in sync was:

    1. On the slave, if necessary: SLAVE STOP
    2. copy the tables on master and slave to a local staging area
    3. On the master: FLUSH TABLES WITH READ LOCK (downtime begins)
    4. rsync the tables on master to local staging area again
    5. master: UNLOCK TABLES (downtime ends, should be seconds long)
    6. master: RESET MASTER
    7. rsync the master’s staging area over the slave’s
    8. slave: FLUSH TABLES WITH READ LOCK
    9. rsync from the slave’s staging area over the slave’s datadir
    10. slave: UNLOCK TABLES
    11. slave: RESET SLAVE
    12. slave: SLAVE START

    (Since all of our slaves are hot spares or backups, we can afford to RESET MASTER and bring them all back up together, instead of counting on starting replication at the right point. That doesn’t mean I won’t try to refine things further, just that that was the minimum requirement that I worked out right now.)

  8. Reading this, I suddenly am reminded exactly *why* I’ve spent all day studying magnetic fields in preparation for an upcoming exam.

    I dance the happy dance of disregard! Wheeee!

    (I *really* should reactivate that old LJ account)

  9. I figured you’d show up eventually. :-) Have I mentioned that they’ve got me working on HP/UX and Win2k as well now? Everything’s going pretty well, though. It’s odd to have Sun kit under my desk again though.

    (And yes, you should, it’s a pain to have to unscreen your comments every time. :-) Or open a new one, since you don’t need invite codes anymore.)