Fun with DHCP


I rolled out a new firewall/DNS server/DHCP server at FreshBooks today. Went well except for one problem: occasionally people would lose DNS resolution. Well, that’s not good.

Checking out their machines showed that their DNS server addresses were being changed to an address on the wrong subnet, and their domain being changed to “mshome.net”. That last part’s a red flag: the thing that does that is Windows’ Internet Connection Sharing, which means someone had that enabled on an interface and we basically had a rogue DHCP server.

Rogue DHCP servers are a pain to track down because without a monitoring port on the switch, all you have to go by is broadcast traffic, and then all you get is the address the DHCP server thinks it’s at — which, we know, is on the wrong subnet anyhow — and its MAC address. And we’re a small shop but we still don’t have a handy list of MAC addresses lying around. I did know that the MAC address’s vendor ID was Dell.

So the first thing I did when I found the problem was to check the MAC addresses of all of the wired and wireless interfaces of the Dell computers in the office, and none of them matched! I puzzled over this for a while, had people double-check, and eventually something clicked and Saul remembered that Sunir had enabled ICS during their road trip.

I took a second look at Saul’s laptop, and there was the MAC address — on a disabled wireless broadband interface. Turns out that if you have ICS on, the DHCP server keeps running even when the shared network interface is down. Disable it, problem went away.

But the strange part was that Saul’s been back for a week and the problem just came up today.

I scratched my head about that for a bit and then it hit me: before today, the switch in the wiring closet was in the Linksys router that also served DHCP:

[client]----[switch + dhcp server]----[saul's PC]

After today, both Saul’s network segment and the new DHCP server were both connected to a separate switch:

[client]-----------[switch]-----------[saul's PC]

                       |

                       |

                 [dhcp server]

DHCP is designed to handle multiple (cooperating) DHCP servers on a segment; when a client sends a request, any DHCP servers can respond, and the client chooses one of the responses and informs the DHCP server that sent it that it will use that one. The usual client implementation is to accept the first response.

So before today, a client on one segment would make a DHCP request, but the legitimate DHCP server (at the switch) would be located one Ethernet segment closer to the client than the rogue DHCP server, so it would always win. As of today, the legitimate DHCP server was now the same distance from the client as the rogue one, so part of the time it’d lose, which is exactly what was happening — not every DHCP lease was broken, just the occasional one.

Sometimes it’s easy to forget that actual electrons need to move around for this stuff to work — which in turn reminded me of Trey Harris’s 500-mile email.


2 responses to “Fun with DHCP”

  1. I recently had to deal with a similar situation, where $BOSS had been playing with some wifi hotspot software over the weekend, and didn’t realise that he was running a DHCP server when he plugged his laptop into the network on Monday morning. We were lucky enough to have a managed switch that records which MACs are linked to each port.

    One thing I found peculiar, and which doesn’t entirely jibe with your argument that physics comes into play where competing DHCP servers are concerned, is that all of the Windows machines got useless addresses from the hotspot application, and all of our Linux machines got useful addresses from the ‘good’ DHCPD. This in spite of the fact that latency to the ‘bad’ DHCPD was noticeably and consistently higher than to the ‘good’ one.

    I suspect that the truth is somewhere in the middle. For a given dhcp client implementation, results are probably consistent, but distance doesn’t always seem to be the definitive factor.

  2. Well, only Windows hosts were affected today, but those same Windows hosts weren’t affected yesterday. That’s the response time at work. That none of the Macs were affected is probably the other side of the coin — but now that I think about it, only the DNS server and domain name were getting updated by the rogue DNS server, so it was almost certainly tickling something Windows-specific.
    Still, that wasn’t the part that was hard to figure out — the tricky part was thinking to look at something that hadn’t changed recently when something obvious had changed that morning.