We had a big software upgrade at work last week. After months of preparation, we were finally ready for the move. However, things didn’t exactly go as planned.
On the late evening of 26th July 2021, we pushed a big software upgrade from staging to production. After a little hassle, we were successful in doing so. Little did we know, things would go south the next day.
For 27th July, we had planned to receive only a limited volume of work so we could test out the new system under very little pressure. However, we received a huge amount of work that had to be completed by 100+ people and be delivered the same day. And then the system stopped working.
We all panicked for some time but didn’t let it get the better of us. After a few hours of continuous fixes, we finally got the system up and running again. My experience from it? If I had a choice, I’d wanna avoid a situation where I felt helpless.
What went wrong:
We tested everything locally and a very few things on the production server which is why the changes, when pushed to the production server threw errors we weren’t ready for.
Despite all our efforts and testing, we couldn’t replicate the real-life scenarios as perfectly as we hoped to.
What went right
The best thing that could have happened in a troublesome situation like that happened to us, we found the problem before it was too late, and were able to fix it just in the nick of time.
The continuous effort from the team when nothing seemed to be going our way was tremendous. We tried and tried and found a solution eventually.
We let everyone know that we were in complete control of the situation (even though we had serious doubts about it) and that it would be fixed soon.
We looked forward to finding an immediate solution to it and not keep the system down for long.
We disabled all the not-so-important features that could potentially make our situation more miserable than it already was.
DOs and DON’Ts
The first rule I’d set for a mega migration is to avoid mega migration in the first place and migrate in small chunks.
Second, I’d not panic. Panicking certainly wouldn’t help us find a solution and instead would make the situation worse.
Third, I’d believe that the solution is coming and we’re close to finding it. Hope is a good thing.
At last, after few hours of continuous try-and-fail situation, we dug ourselves out of a deep hole and brought the system up and running again.
The core problem is usually a tiny bug that is often unnoticed and doesn’t even look like a bug in the first place.
With a feeling of huge relief and a great escape, my teammates and I learned a valuable lesson that day at work, a memorable one and one for the books (or blogs).