I made a stupid, stupid error this week. The task was simple; replace a misbehaving switch with something a little newer and more robust. How hard could that be?
With our head office closing (to be replaced with something smaller and with fewer servers in it), we migrated (over a long weekend) our hardware to a new data center. The move went well, but we had a switch that would drop packets.
Because the head office closed, we took all the hardware that was left over (and anything shiny that we liked the look of – I’m looking at you mini mac hooked up to my TV). So, naturally, I ended up with a bunch of switches and a Palo Alto firewall – all for repurposing.
I stripped the old core switch out of the stuff I didn’t need and configured it for the new site.
I took the “new” switch to the site and (after fixing a few other things), hooked it up.
I first noticed that something was not right when I lost access to our RDP bastion. Physical servers were OK, VMs were not. The hosts were up. The storage was up, but the hosts could not see the storage.
I quickly disconnected the new switch. But things did not improve.
VSS switch bites me in the Ass
Part of the issue was that I didn’t have visibility on the 10Gb switches (a stacked pair of Cisco 4500Xs). The password was not documented.
We managed (after about 4 attempts) to get the password to the first switch reset. Where I learned that the VLANs had all disappeared (well, most of them, and definitely the important ones had).
Eureka! Damn you VTP!
It turns out that the switch I had connected and “trimmed” down, was the VTP server. Because the C4500s had come from the head office, once I had removed the unneeded VLANs, this increased the database revision, so once it was connected up, it wiped out the VLANs on the storage switches. Stupid rookie error. For a deeper dive on VTP, click this handy link!
But what VPT taketh away, VTP also giveth. So I used the misbehaving switch and set it up as a VTP master with the same domain and password (luckily this was documented) and added the VLANs back again. I had to make several adds and removals in order to get the database revision up to the correct level so that the VLANs were then added back.
Once the primary switch had rebooted and the enable password reset (using this guide) we still had issues, Luckily we had a backup of some of the configuration (as once you convert the switch to standalone, it dumps a huge list of errors on the screen when it boots), and we saw that the interfaces had lost their VLAN configuration (as happens when a switch doesn’t have the VLANs to use).
I still had to remove a trunk in order to get rid of a spanning-tree issue, but by about 10:30PM, we had some access back again. The rest could wait till morning.
One step forward, two steps back
I was on-site bright and early and we found that still some of the network was still not up. The first switch looked healthy enough (once we had promoted it back to VSS master)
However, the port-channels were still down and the stack was not in a good state.
We rebooted the second switch and could see that the second switch rejoined the stack (whoop!). However, when the second switch rejoined the stack it also had its interfaces wiped as it then lost the interface configurations as the primary switch did not have them, or even know about the second switch.
Again, luckily we had the configuration and entered this back on.
Also, amazingly, the second switch gained the new password we set on it. This has now been documented!
Everything is now back up and running, but I’m annoyed that stupid errors were made.
1: Never use VTP. Set that switch to VTP mode transparent. It’ll save you issues (in the long run)
2: Never re-use a switch, unless you have wiped it completely and deleted the vlan.dat file.
3: Remember the basics. Never assume that a rookie error will never happen to you, or be caused by you