After my brief stint with the bank and watching the financial and housing markets crumble, I returned to the university. While the bank had the bad fortune of continuing to tank after I left (I should point out, I had nothing to do with this), I had the good fortune of being offered a lead position on the university's web presence team. One benefit of the position I was offered was I had some latitude as to what my specific role should be.
After meeting the other folks on the team and listening to their challenges, three specific problems emerged as priority items:
- They want to get a handle around the intake of new requests and improve the management of the work in general
- They are looking for enhancements to their business continuity and disaster recovery processes
- They they need to improve the stability of the website's backend services running ColdFusion (yes, in 2007, people still ran ColdFusion)
All of these were clearly important issues to tackle, and I'm pleased to say we did address all of them, but for the purpose of this discussion I'm going to focus on the 3rd issue, as it was the one that altered the way I approached future problems.
ColdFusion provides a number of services to websites, including scripting, database functionality, server clustering and task queues. It could handle much of this functionality very well, however as the size of the web applications would grow in complexity, ColdFusion would not always scale properly. For us, this presented when the services would freeze and webpages would stop displaying updates. For the most part, the pages would still render, but new content would get hung up between the submission process and the back-end update process. As a result, we would receive calls that content was not displaying properly and then we would "fix" the problem by restarting the ColdFusion services.
One attempt at proactively "solving" this problem prior to my arrival was to create scheduled tasks in the OS to restart the services automatically every hour, with the two servers in the cluster set to restart 1/2 hour apart. This quelled the problem well enough for awhile, but not long after I arrived, some additional problems started to arise from this. A residual affect of these restarts was that the task queue would collect events that may or may not release properly when the services came back up. So over time, this queue would fill up with events that would then overrun the memory pool, which in turn caused everything to then hang. To resolve this issue, an administrator had to go in and manually clear the queue log - to essentially delete the hung events.
Initially, this was happening once a week or so, but as time went on, it would happen more and more frequently. The point at which it was happening about once a day, we knew we needed a better solution than waiting for a phone call to know if the queue needed cleared out.
The initial solution we arrived at was to see if there was a way to programmatically monitor the queue to watch for the number creeping up. When everything was functioning properly, there should be anywhere from a few events to maybe 100 events if you had a bunch of people submitting changes at the same time. Everything would function just fine though until there were 1000 or more events. So we built an ASP.Net app to just render a simple graphic that displayed green, yellow, red, and purple based on the number of events. Any time that we saw it go red, we knew we needed to go in to clear the queue. So the first step was monitoring the queue on screen.
After running this for a bit, and confirming that it was working correctly, we added a function that would send an email alert as soon as the queue hit red. This way we could be alerted after hours without having to manually keep an eye on things. This at least gave us some freedom from having to check the screen several times a day to see how it was doing. Since it was an ASP.Net app, we could at least check it from a cell phone easily. The second step to this process was proactively sending alerts.
Once we got to this point, I asked the question - is there a way to clear the queue without having to log into the console to do it manually? After some research, we discovered that we could indeed call a function from ASP.Net that we could use to clear the queue. We added this function to the app we created and just populated the logic behind a button on screen, such that when we got an alert we could just pull up the app on whatever computer we were near, including our cell phones, and click the button to clear the queue. This was fantastic on multiple levels, as it was far less work for us now and it could be done easily wherever we were. This way too, instead of one of the administrators always having to hop on their computer to resolve the issue, we were instead able to delegate this to anyone to resolve. We wrote very simple instructions that amounted to "If the screen is red, click the button." The third step to this process was to simplify the process programmatically.
The final step in our process, came rather naturally. We had a button we could push whenever we needed to fix the problem, and we were getting alerts whenever the problem occurred. All we had to do at this point was join the two processes together - whenever it would go to send an alert, why not have it also call the function to clear the queue. In theory then, by the time we got the alert and checked the app, the problem should have already gone away. Once we implemented this step, this specific problem was fully mitigated and virtually eliminated. This last step to this process was automation.
Seeing the benefits derived from this approach to problem solving reinforced this as an approach that could be applied for many future problems (so of which I will cover in later posts). To summarize this approach to troubleshooting and problem solving:
- Set up monitoring - figure out a way to detect the problem before it occurs by identifying leading metrics that are indicators of the coming problem
- Set up alerting - once you've determined how to monitor the leading indicators, further enhance the process (and response times) by alerting folks that actions need to be taken
- Simplify the process - break down the steps to take in such a way that all of the logic can happen behind the scenes, and document the process so others can follow it without having to be experts
- Automate the process - once you're confident that the process is working consistently and you've defined it in a way that doesn't require expert intervention, hook the alerting and resolution logic together so that it automatically resolves itself
This process has proven successful time and again in the years since. As I've worked with other teams along the way, we have built systems that applied these same principles and gained tremendous efficiency in the process.