sysadmin – rich text https://www.lafferty.ca Rich Lafferty's OLD blog Thu, 18 Sep 2008 18:04:00 +0000 en-US hourly 1 https://wordpress.org/?v=5.9.2 Fun with ANALYZE TABLE https://www.lafferty.ca/2008/09/17/fun-with-analyze-table/ https://www.lafferty.ca/2008/09/17/fun-with-analyze-table/#comments Wed, 17 Sep 2008 04:11:48 +0000 http://www.lafferty.ca/?p=943 MySQL has been naughty for me lately.

First, I ran into a neat little issue on FreshBooks’ production servers last week involving the table cache and an O(n) algorithm for selecting a table to close. I wrote up a little explanation over on the FreshBooks blog that you might find interesting if you find any of this interesting. The short version is that if you’re going to be running with a full table cache and still opening tables regularly, you’ll be better off with a much smaller table cache, because finding the least-recently-used table to close is big-O of the size of the table cache. Smaller table cache = fewer tables to determine the LRU.

And then last night, out of the blue, a web forum about tinwhistles that I host hit a wall. About 8:30, my mostly-idle Linode went heavily IO-bound — as in one of the four CPUs spinning in diskwait all the time. What had originally been complex but fast (and common) queries were suddenly taking minutes and minutes to run: things like “get a list of topics in a forum”, and especially “get a list of posts for a forum’s RSS feed”.

There’s a lot of EXPLAIN output here, so I’d better put this behind a cut.

I took a look at the RSS feed query, which seemed the worst not only because it took a long time — 600+ seconds — but because it read-locked every important table in the database. Here’s the query, a big inner join:

SELECT t.topic_title, t.topic_last_post_id, t.forum_id,
    f.forum_name, p.post_time, pt.post_text, pt.bbcode_uid,
    u.username, u.user_id
FROM phpbb_topics t, phpbb_posts p, phpbb_posts_text pt,
    phpbb_users u, phpbb_forums f
WHERE t.topic_status != 1
  AND p.post_id = t.topic_last_post_id
  AND pt.post_id = p.post_id
  AND u.user_id = p.poster_id
  AND t.forum_id = f.forum_id
ORDER BY t.topic_last_post_id DESC
LIMIT 0, 15;

Here’s the mk-visual-explain output. I’ve replaced the table aliases with readable things:

Filesort
+- TEMPORARY
   table          temporary(forums,topics,posttext,posts,users)
   +- JOIN
      +- Bookmark lookup
      |  +- Table
      |  |  table          users
      |  |  possible_keys  PRIMARY
      |  +- Unique index lookup
      |     key            users->PRIMARY
      |     possible_keys  PRIMARY
      |     key_len        3
      |     ref            chiffbb.posts.poster_id
      |     rows           1
      +- JOIN
         +- Bookmark lookup
         |  +- Table
         |  |  table          posts
         |  |  possible_keys  PRIMARY,poster_id
         |  +- Unique index lookup
         |     key            posts->PRIMARY
         |     possible_keys  PRIMARY,poster_id
         |     key_len        3
         |     ref            chiffbb.topics.topic_last_post_id
         |     rows           1
         +- JOIN
            +- Bookmark lookup
            |  +- Table
            |  |  table          posttext
            |  |  possible_keys  PRIMARY
            |  +- Unique index lookup
            |     key            posttext->PRIMARY
            |     possible_keys  PRIMARY
            |     key_len        3
            |     ref            chiffbb.topics.topic_last_post_id
            |     rows           1
            +- JOIN
               +- Filter with WHERE
               |  +- Bookmark lookup
               |     +- Table
               |     |  table          topics
               |     |  possible_keys  forum_id,topic_status,topic_last_post_id
               |     +- Index lookup
               |        key            topics->forum_id
               |        possible_keys  forum_id,topic_status,topic_last_post_id
               |        key_len        2
               |        ref            chiffbb.forums.forum_id
               |        rows           2579
               +- Table scan
                  rows           23
                  +- Table
                     table          forums
                     possible_keys  PRIMARY

See that temporary table at the top that gets used in a filesort? Well…

The whole thing was multiple joins which were then ORDERed and LIMITed. So that meant that it had to find all posts to the forum, ever, and shove them in a temporary table, sort that, and take the 15 most recent posts.

“All posts to the forum, ever” is about 500MB of data. That made the temporary table big enough to go to disk. So every time this query ran and couldn’t be answered from the query cache, it had to write that 500MB file. And the cached query was invalidated whenever someone posted to the forum, which is pretty often.

The problem in this case wasn’t (entirely) the SQL. MySQL was optimizing the query poorly because the key distribution statistics were off. An ANALYZE TABLE on the affected tables fixed that, and gave us:

JOIN
+- Bookmark lookup
|  +- Table
|  |  table          forums
|  |  possible_keys  PRIMARY
|  +- Unique index lookup
|     key            forum->PRIMARY
|     possible_keys  PRIMARY
|     key_len        2
|     ref            chiffbb.topics.forum_id
|     rows           1
+- JOIN
   +- Bookmark lookup
   |  +- Table
   |  |  table          users
   |  |  possible_keys  PRIMARY
   |  +- Unique index lookup
   |     key            users->PRIMARY
   |     possible_keys  PRIMARY
   |     key_len        3
   |     ref            chiffbb.posts.poster_id
   |     rows           1
   +- JOIN
      +- Bookmark lookup
      |  +- Table
      |  |  table          posttext
      |  |  possible_keys  PRIMARY
      |  +- Unique index lookup
      |     key            posttext->PRIMARY
      |     possible_keys  PRIMARY
      |     key_len        3
      |     ref            chiffbb.posttext.post_id
      |     rows           1
      +- JOIN
         +- Bookmark lookup
         |  +- Table
         |  |  table          posts
         |  |  possible_keys  PRIMARY,poster_id
         |  +- Unique index lookup
         |     key            posts->PRIMARY
         |     possible_keys  PRIMARY,poster_id
         |     key_len        3
         |     ref            chiffbb.topics.topic_last_post_id
         |     rows           1
         +- Filesort
            +- Filter with WHERE
               +- Bookmark lookup
                  +- Table
                  |  table          topics
                  |  possible_keys  forum_id,topic_status,topic_last_post_id
                  +- Index range scan
                     key            topics->topic_status
                     possible_keys  forum_id,topic_status,topic_last_post_id
                     key_len        1
                     rows           57912

There’s still a filesort, but it’s now a filesort of a single 57k-row table that’s already been filtered. That table is about 5MB, and fits in tmp_table_size, so doesn’t go to disk. The joins all stack, and the ORDER BY just follows that one-table filesort. The query takes about 0.15s now, or about 4000x as fast.

Incidentally, it can still be improved: that “filter with WHERE” is because of the “WHERE t.topic_status != 1” in the query, and that means “where the topic is not locked”. The idea was that locked topics aren’t going to appear in the last-15-posts anyhow, so may as well exclude them. But if they’re not going to appear because of the sorting, and since you’re sorting anyhow, unless there are a LOT of locked posts that doesn’t matter. Taking out that restriction gets us:

JOIN
+- Bookmark lookup
|  +- Table
|  |  table          users
|  |  possible_keys  PRIMARY
|  +- Unique index lookup
|     key            users->PRIMARY
|     possible_keys  PRIMARY
|     key_len        3
|     ref            chiffbb.posts.poster_id
|     rows           1
+- JOIN
   +- Bookmark lookup
   |  +- Table
   |  |  table          posts
   |  |  possible_keys  PRIMARY,poster_id
   |  +- Unique index lookup
   |     key            posts->PRIMARY
   |     possible_keys  PRIMARY,poster_id
   |     key_len        3
   |     ref            chiffbb.topics.topic_last_post_id
   |     rows           1
   +- JOIN
      +- Bookmark lookup
      |  +- Table
      |  |  table          posttext
      |  |  possible_keys  PRIMARY
      |  +- Unique index lookup
      |     key            posttext->PRIMARY
      |     possible_keys  PRIMARY
      |     key_len        3
      |     ref            chiffbb.topics.topic_last_post_id
      |     rows           1
      +- JOIN
         +- Bookmark lookup
         |  +- Table
         |  |  table          forums
         |  |  possible_keys  PRIMARY
         |  +- Unique index lookup
         |     key            forums->PRIMARY
         |     possible_keys  PRIMARY
         |     key_len        2
         |     ref            chiffbb.t.forum_id
         |     rows           1
         +- Bookmark lookup
            +- Table
            |  table          topics
            |  possible_keys  forum_id,topic_last_post_id
            +- Index scan
               key            topics->topic_last_post_id
               possible_keys  forum_id,topic_last_post_id
               key_len        3
               rows           59900

And with that there isn’t even a filesort and the query finishes in <0.01 seconds, 60000x as fast as the original problem and 15x as fast as the post-ANALYZE optimization. Nice.

]]>
https://www.lafferty.ca/2008/09/17/fun-with-analyze-table/feed/ 1
FreshBooks is hiring a sysadmin! https://www.lafferty.ca/2008/07/07/freshbooks-is-hiring-a-sysadmin/ https://www.lafferty.ca/2008/07/07/freshbooks-is-hiring-a-sysadmin/#comments Mon, 07 Jul 2008 20:22:16 +0000 http://www.lafferty.ca/?p=932 I’m not sure that I’ve got a lot of Toronto-local intermediate-level sysadmins reading here, but just in case, I’ve just posted a job posting for an Intermediate Linux System Administrator (SAGE level III) to our careers site.

The details are all at the link, but basically we’re at the point where there’s enough to do that stuff needs doing in parallel. It’ll be a two-sysadmin shop after this, so there’s lots to do from PCI compliance and new architectures down to maintaining desktops in a casual but busy startup-ish environment with a lot of fun people.

If you’re interested, or if you know someone who might be, drop us a line. The position’s been filled!

]]>
https://www.lafferty.ca/2008/07/07/freshbooks-is-hiring-a-sysadmin/feed/ 1
Exploiting NIC firmware https://www.lafferty.ca/2008/05/16/exploiting-nic-firmware/ Fri, 16 May 2008 18:16:53 +0000 http://www.lafferty.ca/?p=920 From Ben Laurie: Bypass the firewall by bypassing everything but the PCI bus.

]]>
Dear lazyweb: Hyperic, Zenoss? https://www.lafferty.ca/2008/05/05/hyperic-zenoss/ https://www.lafferty.ca/2008/05/05/hyperic-zenoss/#comments Mon, 05 May 2008 22:06:02 +0000 http://www.lafferty.ca/?p=919 Sysadmins on the lazyweb: I’ve used Nagios for years, accompanied by either a homebrew trending/graphing package or Munin. Recently I’ve had a few people draw my attention to Hyperic, and from there I’ve been looking at Zenoss Core as well.

If any of you have experience with Hyperic or Zenoss, and especially if you’ve left Nagios for either, I’d love to hear what you think, whether it be a sales pitch or a warning.

]]>
https://www.lafferty.ca/2008/05/05/hyperic-zenoss/feed/ 27
https://www.lafferty.ca/2008/05/01/heart-linode/ https://www.lafferty.ca/2008/05/01/heart-linode/#comments Thu, 01 May 2008 19:58:37 +0000 http://www.lafferty.ca/?p=918 lish is the ssh-based lights-out admin console for a Linode virtual server.

[rich@dallas64 lish]# cake
Devil's Food Cake - tasaro's office desktop
INGREDIENTS:
* 3/4 cup unsweetened cocoa                   * 1 1/4 tsp baking soda
* 1 1/3 cups granulated sugar                 * 1 teaspoon salt
* 1 1/4 cups milk, scalded                    * 2/3 cup shortening
* 1 1/4 teaspoons vanilla extract             * 3 eggs
* 2 cups cake flour, sifted or stirred before measuring

DIRECTIONS
  Grease two 9-inch layer cake pans and line bottoms with wax paper.
Grease wax paper. Sift the cocoa with 1/3 cup sugar; pour into the
milk gradually; stir until well blended. Set aside to cool. Sift
together flour, remaining 1 cup sugar, soda, and salt. Add
shortening and half of the cooled cocoa and milk mixture. Beat at
medium speed of an electric hand-held mixer. Add eggs, vanilla, and
remaining cocoa and milk mixture. continue beating for about 2
minutes, scraping bowl with a spatula occasionally. Pour into
prepared pans. Bake at 350° for 25 to 30 minutes. Cool in the pans
for 5 minutes; turn out on racks and peel off paper. Cool and frost
devil's food cake as desired.

mmm, cake
[rich@dallas64 lish]#

(It probably doesn’t hurt that Linode is run by Chris Aker aka caker…)

]]>
https://www.lafferty.ca/2008/05/01/heart-linode/feed/ 7
Printer fun https://www.lafferty.ca/2008/04/18/printer-fun/ Fri, 18 Apr 2008 18:18:23 +0000 http://www.lafferty.ca/?p=912 I spent much of the afternoon yesterday on the phone with Dell, debugging a confused printer.

We moved the printer across the room, and following that it wouldn’t print; it’d just sit there at “Printing…”, and the client print progress thing would stay at 0%… until you disconnected the network cable. Then it’d print whatever you’d sent. Weirder still, the same thing would happen with internal print jobs. Print a configuration page? “Printing…” until you disconnect the network cable.

It was still under warranty, so I gave Dell a call. He walked through some obvious things, and then had me flash the firmware on the printer — oops, wait, that’s over the network. Ok, bring the printer over to my desk and… you need Windows to flash it? Ok, over to Levi’s desk, and flash it. No problem. Plug it back in; no luck.

So the Dell guy gives up, they’re just going to send us another printer. Great! But it took a while to figure out whether or not it was in stock, but while it was waiting, a page came out.

Wait, what?

And then, five minutes later, another page. Now, “takes five minutes to print a page” is a very different problem than before! But at this point the replacement printer was being dispatched and the Dell guy didn’t want to do more troubleshooting. But once I got off the phone, I did, and I’m glad.

The first thing I noticed is that the switch lights were blinking like crazy. I tracked that back through two more switches to our Samba server. Aha! Run tcpdump there, and:

14:06:13.459933 IP 192.168.1.151.137 > 192.168.1.150.137: NBT UDP PACKET(137):
REGISTRATION; REQUEST; UNICAST
14:06:13.459933 IP 192.168.1.150.137 > 192.168.1.151.137: NBT UDP PACKET(137):
REGISTRATION; NEGATIVE; RESPONSE; UNICAST
14:06:13.463931 IP 192.168.1.151.137 > 192.168.1.150.137: NBT UDP PACKET(137):
REGISTRATION; REQUEST; UNICAST
14:06:13.463931 IP 192.168.1.150.137 > 192.168.1.151.137: NBT UDP PACKET(137):
REGISTRATION; NEGATIVE; RESPONSE; UNICAST

And as you can see on the timestamps there, both ends were talking as fast as they can — the printer sending NBT registration requests, and the Samba server sending errors back, over and over, hundreds of times per second. Tell the printer to forget about its Samba server, and voila, printing’s back to normal.

So what happened? As best as I can tell, one of two things: Either moving the printer made it get a DHCP configuration for the first time in over a month, since we rolled out a new DHCP server in the meantime; or it’s been slow all along, and moving it to the same switch as the Samba server, instead of two switches away, made it marginally busier, enough for it to not print at all instead of just printing slowly.

Still, I could think of better things to have spent an afternoon on.

]]>
No more Unix mail at Dreamhost https://www.lafferty.ca/2008/04/09/no-more-unix-mail-at-dreamhost/ https://www.lafferty.ca/2008/04/09/no-more-unix-mail-at-dreamhost/#comments Wed, 09 Apr 2008 13:54:51 +0000 http://www.lafferty.ca/?p=911 I left DreamHost just in time:

We’re no longer allowing (new) FTP/SHELL users to have an email address associated with them.
[…]
Fortunately, this change should be more or less invisible to everybody! The only thing lost is the ability to see and manipulate your mail files via FTP/Shell… (and even that is only for new users from now on). Whoop-dee-do, I say!

Right, why would anyone want to use their own SpamAssassin, procmail, or a Unix mail client? I never had a problem with overselling at Dreamhost — in fact, I’d go so far as to say that I’m happy to take advantage of it — but I don’t think that’s their problem. I think they’ve just let themselves grow until they’re deep over their heads.

(And yes, that doesn’t affect existing shell accounts there, but I imagine that’s just a matter of time, because it’s not like running two parallel mail architectures is going to help them much.)

]]>
https://www.lafferty.ca/2008/04/09/no-more-unix-mail-at-dreamhost/feed/ 1
I’m on Linode now! https://www.lafferty.ca/2008/04/07/on-linode-now/ https://www.lafferty.ca/2008/04/07/on-linode-now/#comments Tue, 08 Apr 2008 02:45:52 +0000 http://www.lafferty.ca/?p=908 Linode logoAfter my post about my Dreamhost experiences, I finally decided that enough was enough and signed up for a Linode. I should’ve done this ages ago.

For $20/mo, I get a virtual server (using Xen, which is conceptually like VMware if you’ve heard of one but not the other) with 360MB of RAM, 10GB of disk, 200GB of monthly bandwidth, a true remote console, and full root access. There’s no CPU or I/O limiter; you’re expected to play nicely but you can burst to the capacity of the hardware (which in my case is a dual quad-core Xeon shared with 39 other Linodes; the bigger Linodes have fewer neighbours). You choose your data centre from three options, too — I’m in Dallas, 2.6 ms from FreshBooks’ servers. And they don’t oversell: there’s often a waiting list for a particular size virtual server, because if the current servers are full they just don’t sell any until they get more servers.

When I moved to Dreamhost, I’d been a sysadmin on a communal coloed box hosted by a friend, and that eventually turned into a drag due to unreliable hardware and unreliable users. I’d decided that I sysadminned enough during the day and that someone else could be my sysadmin. But I was never really happy with that; the web side of things was okaaaay, but not having control over the mail server was a pain, and having hardly any visibility of what MySQL was doing was annoying.

That’s solved now! I’ve moved all of our sites except the whistle forum to the Linode, and my and Candice’s mail is there too. It’s crazy fast compared to Dreamhost (especially IMAP), and I’ve got the flexibility to play with things; one weekend I installed four or five alternative webservers and loadbalancers and switched between them, just to get used to their quirks before trying them out at the office, and then back to Apache again.

But what really won me over at Linode was service. It’s a small shop — there can’t be more than five or six employees, support tickets are addressed in minutes instead of days, the userbase is friendly to each other on the forums, and a bunch of senior staff including the owner all hang out on the support IRC channel. I ran into a weird issue once and was sharing my Munin graphs with him minutes later. Even though we never tracked down exactly what happened I’m completely confident in these guys.

They offer virtual servers from my little $20/mo one up to an $80 1.4GB-40GB-800GB/mo plan. They’ve got no referral programs or discount codes; just great performance and great service, and are a great place to dip your toes into system administration, finally get that personal colo box, or even set up a remote monitoring box for critical work-related services.

Ages ago I was doubtful about virtual servers, but that was when $20 only got you 60MB of RAM; now that you can run pretty much anything you’d want to, it’s working out great.

]]>
https://www.lafferty.ca/2008/04/07/on-linode-now/feed/ 4
Dreamhost: a comedy of errors https://www.lafferty.ca/2008/03/27/dreamhost-a-comedy-of-errors/ https://www.lafferty.ca/2008/03/27/dreamhost-a-comedy-of-errors/#comments Thu, 27 Mar 2008 15:35:50 +0000 http://www.lafferty.ca/2008/03/27/dreamhost-a-comedy-of-errors/ You may have noticed that this place was hard to get to for the last week or so.

I’ve hosted this blog and a bunch of other websites on Dreamhost since 2004, and I’ve referred enough people to them that my hosting there has been free for years. But most of those four years have been spent just below the “I need to do something about this” level of dissatisfaction.

As of this last week, though — which featured a 12h planned outage followed by the rest of the week trying to recover from NFS problems which left sites unresponsive or just plain missing — I’ve had enough, and I’ve bought a virtual private server at Linode instead. I’ll post more about Linode later on, but last night Dreamhost resolved their NFS issues and I had a brief moment of reconsideration. After all, it’s free

So I brought up my support history and read through it, and once again I’ve convinced myself it’s time to move critical services away from there. But the more I read it, the more I realized I should share the highlights of my experience. Like last time with IStop, my awful Ottawa ISP, I’m left wondering why I stuck around so long!

The details are after the cut.


May 21, 2004: Apache configuration prevents files named “README.txt” from showing in directory listings. Support writes,

Can’t you just rename it to something that will not be filtered like that?

May 23, 2004: I complain that the installed SpamAssassin is ancient. Support writes:

[W]e may upgrade at some point, but you’d really want to install your own version if you want to stay current at all. […] I would definitely not suggest using 2.20 for anything at this point.

Sep 15, 2004: Payment fails:

Current Balance: -$9.94
Amount Due: $9.94
Due Date: 2004-09-15

Failure! Please correct the errors below.
THERE IS A $9.95 MINIMUM FOR PAYMENTS

Feb 5, 2005: Fileserver damage. I get five copies of a “Your data has been restored!” form mail. I point this out to support, since I assumed at that point I’d keep getting copies of the form mail forever. They reply:

One of our server clusters was having fileserver issues on Friday. The account you’re writing in from was not affected.

I point out that no, it was affected, the restore was indeed successful, and I want to stop receiving mail about it. They reply apologizing for the multiple messages but continue to insist that my data (which was missing 24h previously) was not affected.

May 8, 2005: I get mail at 2AM:

I had to disable your database chiffbb. It was using enough of the CPUs on the server to justify having its own server. If you want to continue running your bulletin board, you should consider a dedicated server.

I point out that it has been running at a constant load for months, except that recently they replaced the database server:

So, now I have no access to the data, no more access to the conuery statistics to see if something went wrong recently or if it’s been building up slowly, no idea what the problem queries were, no way to make sure the indices I needed were there — nothing at all to work with aside from “enough of the CPUs”.Nothing has changed on the forums in the past year, so I’m a bit confused as to what might have happened overnight to get your attention.

Please let me know what I’m supposed to do at this point to figure out if things can simply be scaled back, given that “blindly trust you that I have to give you more money to get my data back” is an unacceptable option.

(Incidentally, shutting things down with no grace period at 2 AM on Sunday — and Mother’s Day no less — doesn’t really seem to match the whole “we’ll be nice about it” from your conueries knowledge base page. I’d recommend either going back to a hard quota which people can compare to their usage, or giving a grace period for this sor, otherwise it is impossible for users to actually manage their usage.)

They eventually re-enable the service and tell me their MySQL guy will get in touch with me to figure out what’s going on. That never happens and I don’t hear anything about the forums again for a while.

Nov 15, 2005: One of four IMAP servers isn’t authenticating. Mail problems are the new black.

Jan 16, 2006: Someone’s added “dnsalias.com” and a dozen other dyndns.org second-level domains to their account, making Dreamhost’s nameservers (which are both their customer authoritative nameservers and their resolvers) refuse to believe that anyone else’s dyndns hostnames exist. They remove “dnsalias.com”.

Apr 26, 2006: Home directories under /home disappear on the mail server cluster, although the actual mountpoint at some long undocumented path still works.

Jun 27, 2006: “crontab -e” reports “Permission denied”.

Jun 30, 2006: Home directories under /home disappear on the mail cluster again.

Aug 3, 2006: Home directories under /home disappear on the mail cluster AGAIN.

Aug 5, 2006: Internal reverse DNS fails, and suddenly nothing can authenticate to MySQL, which has grants to ‘user’@’hostname’.

Aug 11, 2006: Home directories under /home disappear on the mail cluster. Again.

Aug 16, 2006: Home directories… yeah. Support reply begins:

I was actually going to email you to let you know that we had a problems with those machines, however, I couldn’t remember your ID.

Aug 20, 2006: The mail problem from the 16th is resolved four days later.

Oct 2, 2006: Remember back in January where someone claimed “dnsalias.com”? It happened again. They ask me to provide a full list of “the domains I own”, even though I explained what dyndns.org was in my request. At least this time they add all of dyndns.org’s domains to their list.

Feb 24, 2006: Home directories. Mail. Yep.

Mar 16, 2007: Remember back in 2005 when they shut down the whistle forums because of load? Guess what! Again I point out that the load has been constant for months and that the only change was a new server on their end. They’re less angry this time, at least, and they again reconsider:

Actually, I’m still digging into the load on this server, and the more I dig, the more I see that the throttle on your site is pointless :) I’m very sorry about that, I actually went ahead and removed the throttle as it wasn’t bringing the load down at all. I’m still looking into the load and will let you know when I pin it down. Don’t worry though, you were a false alarm, at the time, you were the busiest site, and you were the best candidate, however, the wrong one. I’m very sorry about that!

This time I convince them to put a note on my account which basically says “This is the first site you’ll notice when this cluster gets slow, but it’s not the root cause”.

May 8, 2007: Dreamhost gets listed in the CBL.

May 18, 2007: Mail server won’t accept mail. “450 Server configuration problem.”

Jun 8, 2007: Mail bounces with “unknown user”.

Jun 9, 2007: Mail server home directories again. Not the usual root cause, though:

We had an issue with our mail updating system where the server responsible for password updates cut off the password file short.

Jul 18, 2007: They accidentally disable relaying from localhost on the webserver’s mail server. Suddenly no web apps that use SMTP can send mail. Reply in part:

Looks like this was an accidental change made to the mail config when one of the admins altered something else.

Aug 11, 2007: Mail server home directories.

Sep 19, 2007: Mail spool fills up. “452 Insufficient system storage”, complains Postfix. Reply in part:

Sorry about that! Our systems normally notify us before there are any problems, but it’s been rather busy lately with larger issues.

Oct 25, 2007: pop3 logins failing. I love this part of the (auto)reply:

Note: This was an announcement due to a large support incident. Sorry if you did not get callback support.

Jan 18, 2008: Dreamhost accidentally bills customers for their entire next year of hosting, a $7.5-million error that ends up costing DreamHost over $500,000.

Mar 21, 2008: The outage that prompted this post and my move: growth issues necessitates moving my webserver’s cluster to a new data centre, which involves a 12-hour scheduled outage. Were that not enough, following the move, load hovers around 10, and disk operations take seconds to complete. The load/IO problem isn’t resolved until Mar 26.

Dreamhost will still be handy for the whistle forums and anything I need to host a lot of noncritical but big data for, but I’m looking forward to a change.

]]>
https://www.lafferty.ca/2008/03/27/dreamhost-a-comedy-of-errors/feed/ 8
Fun with DHCP https://www.lafferty.ca/2008/03/19/fun-with-dhcp/ https://www.lafferty.ca/2008/03/19/fun-with-dhcp/#comments Wed, 19 Mar 2008 19:50:40 +0000 http://www.lafferty.ca/2008/03/19/fun-with-dhcp/ I rolled out a new firewall/DNS server/DHCP server at FreshBooks today. Went well except for one problem: occasionally people would lose DNS resolution. Well, that’s not good.

Checking out their machines showed that their DNS server addresses were being changed to an address on the wrong subnet, and their domain being changed to “mshome.net”. That last part’s a red flag: the thing that does that is Windows’ Internet Connection Sharing, which means someone had that enabled on an interface and we basically had a rogue DHCP server.

Rogue DHCP servers are a pain to track down because without a monitoring port on the switch, all you have to go by is broadcast traffic, and then all you get is the address the DHCP server thinks it’s at — which, we know, is on the wrong subnet anyhow — and its MAC address. And we’re a small shop but we still don’t have a handy list of MAC addresses lying around. I did know that the MAC address’s vendor ID was Dell.

So the first thing I did when I found the problem was to check the MAC addresses of all of the wired and wireless interfaces of the Dell computers in the office, and none of them matched! I puzzled over this for a while, had people double-check, and eventually something clicked and Saul remembered that Sunir had enabled ICS during their road trip.

I took a second look at Saul’s laptop, and there was the MAC address — on a disabled wireless broadband interface. Turns out that if you have ICS on, the DHCP server keeps running even when the shared network interface is down. Disable it, problem went away.

But the strange part was that Saul’s been back for a week and the problem just came up today.

I scratched my head about that for a bit and then it hit me: before today, the switch in the wiring closet was in the Linksys router that also served DHCP:

[client]----[switch + dhcp server]----[saul's PC]

After today, both Saul’s network segment and the new DHCP server were both connected to a separate switch:

[client]-----------[switch]-----------[saul's PC]

                       |

                       |

                 [dhcp server]

DHCP is designed to handle multiple (cooperating) DHCP servers on a segment; when a client sends a request, any DHCP servers can respond, and the client chooses one of the responses and informs the DHCP server that sent it that it will use that one. The usual client implementation is to accept the first response.

So before today, a client on one segment would make a DHCP request, but the legitimate DHCP server (at the switch) would be located one Ethernet segment closer to the client than the rogue DHCP server, so it would always win. As of today, the legitimate DHCP server was now the same distance from the client as the rogue one, so part of the time it’d lose, which is exactly what was happening — not every DHCP lease was broken, just the occasional one.

Sometimes it’s easy to forget that actual electrons need to move around for this stuff to work — which in turn reminded me of Trey Harris’s 500-mile email.

]]>
https://www.lafferty.ca/2008/03/19/fun-with-dhcp/feed/ 2