Twelve Months For One Night

The title of this blog, “Twelve months for one night” is exactly what this blog is about, the last 12 months leading up to this coming Saturday night.  For the last year, a team of 40 including myself and CDW have been meticulously planning an upgrade for 5500 agents and another 10,000 administrative users on top of that.   This blog is to give insight into what a project this size entails and acknowledge all the great work that has been contributed by the team to get it where it is.

Objective

Upgrades happen, equipment gets old and runs out of support, and the bigger the task of upgrading the longer it seems to get put off.  My current employer had a pretty big environment when I started 2.5 years ago.  Below is a basic overview:

  • 2 UCCE Clusters, one at v7.2 and one at v7.5, both with about 15 sites each, however the v7.2 is smaller sites.
  • 8 CVP Servers, 4 H.323 servers for the v7.2 UCCE cluster and 4 SIP servers for the 7.5
  • 5  CUCM clusters, ~30 servers. (v7.1.5)
  • ~100 IOS Gateways
  • Geographically redundant on everything.
  • Lots of custom built CVP apps, custom softphone for use with CTIOS
  • CUE’s, Unity XN, Remote Silent Monitoring, Exony, VPI Call Recording, wallboards.
  • An unending list of products attached to it I can’t even remember.

Now all this is on old MCS servers and some of it such as the UCCE v7.2 is already out of support.  The goal was to upgrade EVERYTHING to the newest versions of their platforms (v9.x) and move it all to shiny new UCS B Series virtualized boxes.  Sounds pretty straight forward to me (Hint: it wasn’t).  lets take a look at some specifics.

  • One of our two main data centers is in a hurricane zone, so we decided to move it up as far north as we could until you have to walk around with tennis rackets on your feet.  (This meant IP’s would have to change in at least one locations)
  • We moved to a new corporate Active Directory instance instead of one we were hosting ourselves.
  • When you review the compatibility matrix for v9.x and v7.x compatibility, it is a LONG upgrade to do it piece by piece where they inter-operate with support.  You can find the matrix here: http://docwiki.cisco.com/wiki/Unified_CCE_Software_Compatibility_Matrix_for_9.0(x).  If anyone cares I will details the plan to migrate this way.  I have done it before, and it’s miserable.  It does make it so you can avoid ‘flash’ cuts to new equipment like we are doing this coming weekend.  When we drew it out we were looking at 10 different weekends OUTSIDE of all the other tasks that had to be completed.
  • We are switching from Physical to Virtual Machines
  • We have Intl sites and 24/7 x 7 days a week.  There is never a good time to do anything.

Strategy

There are a few things that really need to happen in order to make a project of this size come together.  The first and most important by far is good, reliable people on the team.  When people are dedicated to succeeding, you rarely miss your goals.  I look at project teams just like a sports team, everyone needs to pull their weight to get a win.  When they do, it’s a beautiful thing.  Internally our staff has been growing for the 2.5 years I have been here and CDW provides one of the best Contact Center practices in the US, so we nailed that one.

Project Management. Project Management.  Project Management.  Getting it yet?

Based off that last note the obvious next thing you need is a great plan.  The first thing you notice about our old environment is that it had 2 UCCE and 2 CVP clusters.  The obvious first move in the project was to collapse these into the 7.5 UCCE instance.  These 7.2 sites were on separate CUCM and CVP clusters from the ones on the 7.5 UCCE instance.  Now, I argue this was the hardest part of the project.  Most of the sites were International and call centers.  A few notes about this process.

  • This migration was brutal.  It took us 6 weekends to cut 15 sites over between clusters and about 5 people working hard every week to fit this in.
  • You can’t use backups because the 7.5 is already live.  You must move every phone (through BAT exports), build every dial-plan from scratch, move every UCCE and CVP custom app, and UAT.  This was easily the most challenging project I have undertaken in my career.  I have a ton of tools I wrote to help do the User, Phone and Agent migrations I will release another time.
  • We migrated the sites from 5 to 7 digits and from H.323 to SIP in the same cutover night.  We also migrated them from old dial-plans to awesome new E.164 plans.
  • I can’t stress enough the next time I am asked to collapse two UCCE clusters into a single one, it will have to be for a client I really like.  It was tough and with all the other changes on top of it, we may have bitten off more then we could chew.  Our team was incredible though and we knocked it out between May and November of 2013.

By November we were far into planning, our new infrastructure had been built out by CDW on our UCS B Series boxes.  It was integrated into our new AD.  Huge key here, when you move a UCCE to a new domain, make sure to keep the instance name the same, or backup’s won’t restore.  The business was doing UAT, the birds were chirping and everyone sang kumbayah by the fire.  Some key decisions during this phase.

  • We have a team of nearly 10 just in the call routing group.  That’s a ton of changes, and restoring a backup of our environment for UAT would look totally different 2 months later.  We decided we had to take a backup of UCCE and CUCM 3 times between October and March to allow UAT to happen with new enough scripts.  This meant a ton of work, having to restore the system and upgrade it from 7.5 to 9.0 three times before the night of the actual cut.
  • Move out other conversions.  Our environment may be massive, but we have a ton of sites still to convert over to Cisco.  Tons of Avaya left.
  • Upgrading 100 gateways takes a long time and lots of dedication to scheduling RFC’s and needed lots of RAM upgrades to be installed all over the world.  We did this over almost the whole year.
  • It’s OK to tell people that won’t work.  When you start projects this big, everyone comes in ready to kick some ass.  They have huge ideas, shortcuts, and plans that would make David Blaine salivate.  After the first 40 person planning call, you realize that x won’t work.  X in that last sentence represents about 90% of ideas in the room including mine.  This is the phase where you start weeding through ideas to formulate real plans for a cutover.

 

Crunch Time

Eventually every projects campfire goes dark.  Many of the people on this project are at the top of their game so for something of this size, we had done pretty well through November (and through the end of their project).  The end months are when the good stuff usually comes out, real issues that take real solutions.  Also many of the third party or tougher apps hadn’t come online for UAT yet.  A couple key events that occurred in this time:

  • Remote Silent Monitoring is the most annoying product ever.  The guide should say something such as “Hop three times while juggling flaming spoons in order to troubleshoot “GENERAL ERROR””.  Seriously, every time it fails regardless of the 100 possible reasons it can fail, the log says GENERAL ERROR.  Thanks a bunch.  One of my co-workers did a rain dance while counting backwards from 14 and that got us over the hump.  Kudos to you my friend.
  • We converted from the SCCP outbound dialer to the SIP dialer.  Although I know it’s an improvement, they took a product that already takes pure luck to get setup correctly the first time, and said, “Alright now you have to be a rocket scientist to install it”.  SIP gateways and Proxies and translations routes to CVP from the dialer (step backwards much) make it even more miserable to get working the first time then it usually is.  In good form though its robust unreliability seems to still be its key selling point.
  • UCCE and Siebel connector, I knew when we scoped out all of the  components in this call center upgrade, the dialer and Siebel were going to be the hardest.  Good news for us they were both in use at the same call center!
  • Siebel had to update plenty of code to get everything working properly with the SIP Dialer.  Seems to me the events in CTIOS server are unreliable in new SIP Dialer, and thankfully our Siebel team are champions who worked around it.  Still waiting for an answer from TAC on the CTIOS events from this.
  • Apparently everyone but me uses our HDS for reporting project based off raw SQL queries into other Data Warehouses.  This was hard to trace down at times, but once we found the right person was very easy.
  • CTIOS Client and custom softphone updates.  This is just never a fun process, especially when you talking about thousands of desktops and dozens of locations around the globe.   Our packaging team did an awesome job but because it’s a Java softphone and Java‘s install is just so reliable, you know it must have gone perfect!
  • Unity XN is one of the few things easily compatible at 9.x with CUCM 7.1 and CUCM 9.1.  4000VM users have lots of mail, and it took forever to backup and restore.  We decided to upgrade our 8.6 Unity XN on old MCS hardware to 9.1 and then restore a backup to UCS on 2 separate weekends BEFORE the big migration.  This would save a big chunk of time on our night of the ‘flash cut’
  • As I mentioned we were moving data centers and in our old datacenter we had 2 DS3’s. These also had to be moved to Woodbury, as well as our Ingress gateways, VXML Gateways and SIP Proxy.  This is another set of tasks we decided to chop off into a separate weekend ahead of time.  This was a great move.  You can read more about this move in the blog topic “Laying down the Hammer” http://blog.cloverhound.com/2014/02/16/laying-down-the-hammer/
  • Our upgrade was originally scheduled for the beginning of March and it was clear we weren’t going to make it and the decision was made to push it back to April.  This was another great move, sometimes the best strategy is to pick an aggressive date to keep everyone motivated and on task.  It’s always OK to buy yourself a reasonable amount of time if it will result in a higher quality of work in the end.  At least I personally believe this.

Conclusion

The moral of that story is basically there is no shortcut, the best you can do is try and find chunks of the process you can break off into their own tasks, to make everything more approachable for success. I treat big projects like a chess game looking as many moves ahead as I can.

Projects like these are the ones that build great teams and lasting relationships between those members, so look at it as an opportunity rather than a job.  I have had a great time going through the process even though it took a year.  Cut weekend starts tomorrow, I will post a blog about cut night through go-live next week.