I rolled out a new firewall/DNS server/DHCP server at FreshBooks today. Went well except for one problem: occasionally people would lose DNS resolution. Well, that’s not good.
Checking out their machines showed that their DNS server addresses were being changed to an address on the wrong subnet, and their domain being changed to “mshome.net”. That last part’s a red flag: the thing that does that is Windows’ Internet Connection Sharing, which means someone had that enabled on an interface and we basically had a rogue DHCP server.
Rogue DHCP servers are a pain to track down because without a monitoring port on the switch, all you have to go by is broadcast traffic, and then all you get is the address the DHCP server thinks it’s at — which, we know, is on the wrong subnet anyhow — and its MAC address. And we’re a small shop but we still don’t have a handy list of MAC addresses lying around. I did know that the MAC address’s vendor ID was Dell.
So the first thing I did when I found the problem was to check the MAC addresses of all of the wired and wireless interfaces of the Dell computers in the office, and none of them matched! I puzzled over this for a while, had people double-check, and eventually something clicked and Saul remembered that Sunir had enabled ICS during their road trip.
I took a second look at Saul’s laptop, and there was the MAC address — on a disabled wireless broadband interface. Turns out that if you have ICS on, the DHCP server keeps running even when the shared network interface is down. Disable it, problem went away.
But the strange part was that Saul’s been back for a week and the problem just came up today.
I scratched my head about that for a bit and then it hit me: before today, the switch in the wiring closet was in the Linksys router that also served DHCP:
[client]----[switch + dhcp server]----[saul's PC]
After today, both Saul’s network segment and the new DHCP server were both connected to a separate switch:
[client]-----------[switch]-----------[saul's PC] | | [dhcp server]
DHCP is designed to handle multiple (cooperating) DHCP servers on a segment; when a client sends a request, any DHCP servers can respond, and the client chooses one of the responses and informs the DHCP server that sent it that it will use that one. The usual client implementation is to accept the first response.
So before today, a client on one segment would make a DHCP request, but the legitimate DHCP server (at the switch) would be located one Ethernet segment closer to the client than the rogue DHCP server, so it would always win. As of today, the legitimate DHCP server was now the same distance from the client as the rogue one, so part of the time it’d lose, which is exactly what was happening — not every DHCP lease was broken, just the occasional one.
Sometimes it’s easy to forget that actual electrons need to move around for this stuff to work — which in turn reminded me of Trey Harris’s 500-mile email.