I don’t often do ranting posts, which is good, but I have learned to never say never. Which is why we are here now.
We recently went through an upgrade of a rather core part of our infrastructure. Two sites in the UK and two sites in the US received new equipment, and I was leading the project. Our in-house knowledge of the replacement equipment is not vast, especially as they new version is a complete change to the old (I am being deliberately vague here for a number of reasons). So, with this in mind, we enlisted outside assistance.
The UK installs were done by an engineer (we’ll call him GoldStar) that was employed directly by our supplier/support company. We had several meetings prior to the installations, with many many questions being asked by the engineer to get a good picture of our environment, requests for many lists of how the equipment is being used currently, the needs going forward, and lots of excel sheets to be filled in.
When it came to the actual installs, GoldStar arrived on time and knew his stuff. We encountered one problem, which he was able to work through with me, we rectified the problem together. He clearly was not fazed by the problem and the job was completed well within the two days (per site).
Considering that we had now done this in the UK, we figured that the US would go smoothly. Especially as we were dealing with engineers that came directly from the manufacturer.
We had one meeting with the people involved. They asked our requirements, which are (I think) nothing overly challenging, I made it clear that we were not well versed with the new equipment, so we were going to “the experts” as they could get it done quicker, and more importantly, follow all the best practices that we were not familiar with, because they would have the experience that we lacked.
The first install hit a big problem, some of the LACP trunks between the new equipment and our network would not come up. Moving the interfaces (and reconfiguring) resulted in the same issue.
The engineer (we’ll call him engineer1) did some googling, and said he would call his friend who used to work for the vendor, because it turns out, engineer1 does not work directly for the vendor, he is a third-party consultant! It also transpires that engineer1 had not been working in this role for very long. In fact, this was his first IT role.
Alarm bells are ringing.
Not the experienced expert we had been hoping for. This does not speak badly for the engineer. He is certified in this stuff, but clearly, the vendor did not follow our request for an experienced engineer, and engineer1 is a bit green and clearly out of his comfort zone, without knowing where to seek help from.
I ended up working from 3pm on Saturday until 1:30am Sunday morning, at which stage I said to remove all the affected cables, and we’d have another crack at it in the morning.
First thing in the morning, I logged on, fired up the VPN, and removed all the trunking information from the interfaces, removed the port-channels, added the port channels back and then added the interfaces back to the port channels – nothing actually changed in the configuration. Then I waited (until lunch time UK time) for my colleague to arrive on site and we then went through each cable one by one. Everything came up as it should. This was completed just after 8am local time. This was the time that had been arranged between my colleague and engineer1 to meet at site and complete the configuration.
An hour later, engineer1 had still not arrived on site. Nor has he responded to attempts at contact by my colleague.
More alarm bells.
My colleague headed off while I surveyed what configuration had been done and what needed to be done.
Engineer1 made contact a bit later, but had a midday flight to make. So we would not have been getting the 2 days onsite anyway.
Quite frankly I was shocked. Less than half the configuration had been done. I decided that enough was enough and family time was required, so I would look further on Monday morning.
As part of our requirements (which I had sent in a nicely tabled Word doc the week before), we needed 23 interfaces configured. Only 12 had been done, and I had to delete and recreate three of those to match our naming conventions – again this was all in the word document.
I had fixed all the remaining work within two-three hours on Monday. Which included the time taken to research how to do it (again I reiterate that I am not well versed in the installation and configuration of this stuff, only the day to day maintenance).
I duly scheduled a conference call with the respective parties for the following day.
During the call, engineer1 said that he had followed the document I sent, at which I snorted a bit and said the blatantly had not, otherwise I would not have had to delete some of his work. The vendor made it very apparent that they would not let the same mistakes happen on the next install (due that weekend).
Deciding to give him the benefit of the doubt, I let engineer1 validate what changes I had made, and to give him credit, he did do a fairly good job at validating, though he did use some expletives, which I found surprising, given that he was on a call with a customer. Now, I am no saint. I will say “shit” and “fuck”, but would never dream of doing so in front of a customer, it would reflect badly on me and my employer. But there we are.
This time the vendor sent us a spreadsheet to fill in, and I filled it in as best I could, again, not understanding all the terminology I referred them to my original word doc, which had all of the remaining information they would need in it, ready for the other engineer (we’ll call him engineer2) they were supplying to do the next data center install. Engineer2 is from the same third-party as engineer1.
Saturday rolled on and my colleague was on site at 9 am (3 pm my time), let me know that engineer2 was on his way and they were going to crack on.
We worked together making sure everything was cabled correctly and I helped with the particulars as they did the install.
There was a problem encountered that engineer2 opened a support ticket with the vendor for, it was resolved by rebooting (twice).
Just before 10 pm (UK time), they let me know that there was a problem with one of the virtual interfaces on the new equipment. We started to troubleshoot this. Learning from the previous install, we went through the LACP trunks and used CDP to make sure everything was cabled as it should be.
I troubleshooted with them until about 11:30 at which time I had to call it a night (I had been working with them for 8 1/2 hours by this stage).
We synced back up together at 2 pm on Sunday (8 am local time). At 3 pm they found a difference between the two sites, which was assumed (by engineer2) to be an incorrect setting on the first data center. Well, at least we are getting somewhere, but why assume that the working data center is incorrectly set up?
Engineer2 asked for some commands to be run on the UK datacenters, I did and supplied the output to compare.
We finally get to the crux of the issues with this install.
Massive fucking alarm bell.
Engineer1 had filled out the form for Engineer2’s installation. Somewhere along the line, engineer1 decided that my carefully laid out word document contained a typo. So he “corrected” it.
Bear in mind that engineer1 had already (partially) configured one site as per my document…
Now, considering engineer2 is more experienced (at least going by his LinkedIn profile), he should have validated the information supplied by engineer1 against the information/requirements supplied by the customer.
At 3:22 engineer2 is now correcting yesterdays work to match the information I had supplied. However, engineer2 believes that “we may not have this configured for optimal redundancy”.
OK, so let’s get a picture of what he has set up, shall we?
So, in the above image, we have two virtual interfaces (vif1 and vif2). Vif1 uses interfaces 1 and 3 on “head” one, and vif2 uses interfaces 2 and 4 on head 2. These then connect up to the switches. Vif 1 connects to port 8 on each switch, and vif 2 connects to port 9 on each switch.
The idea is that if we lose a head (one or two in the above picture) then the other head will take over and data will continue to flow. Similarly, if we lose a switch, then data will continue to flow. In fact, we could lose a head and a switch and data would continue to flow. How does this not have “optimal redundancy” (at least from a cabling perspective)?
So why did it go wrong? Well, it all depends on how the switches have been configured. Each of the if’s end up on a VPC on the switch – the NW8’s are in VPC 30 for example and NW9s are in VPC 31. Now if these VPCs are configured to carry different VLANs, the above solution is broken.
The engineer had put all services on head one coming out of one vif and all of the services on head 2 coming out of a different vif.
Instead, this is what should have been done:
The cabling is all the same, but this time gets carried to VPCs that actually run the same VLANs. Vif2 on each head ends up in different VPCs, but again these VPCs can still talk to each other – which is why the other datacenters worked fine.
This begs the question, which did the engineers not consult me (who set up the network), or refer back to the document I sent, which would have directed them to the solution?
I think a lot of it has to do with arrogance. They are the consultants that we are paying thousands of dollars to set this up for us. They come in with their knowledge (be it large or small) and are either too up themselves to ask the client, or too shy to ask the client why it may not be working. Either way, hours of time were lost.
This is one engagement that I will not repeat. It would have been cheaper for me to learn what was needed, and using GoldStar’s original configuration for the other sites, to fly me out and have me do it.
So many mistakes were made by the vendor and the third-party consultants in this engagement. The primary mistake is that they did not listen to the client.
I am still waiting on their design document. I am not holding out that it will be any good to be honest. This whole thing has been unprofessional from the get-go.