Thursday, July 30, 2020

Clearing the Queue


After my brief stint with the bank and watching the financial and housing markets crumble, I returned to the university. While the bank had the bad fortune of continuing to tank after I left (I should point out, I had nothing to do with this), I had the good fortune of being offered a lead position on the university's web presence team. One benefit of the position I was offered was I had some latitude as to what my specific role should be.

After meeting the other folks on the team and listening to their challenges, three specific problems emerged as priority items:
    1. They want to get a handle around the intake of new requests and improve the management of the work in general
    2. They are looking for enhancements to their business continuity and disaster recovery processes
    3. They they need to improve the stability of the website's backend services running ColdFusion (yes, in 2007, people still ran ColdFusion)
All of these were clearly important issues to tackle, and I'm pleased to say we did address all of them, but for the purpose of this discussion I'm going to focus on the 3rd issue, as it was the one that altered the way I approached future problems.

ColdFusion provides a number of services to websites, including scripting, database functionality, server clustering and task queues. It could handle much of this functionality very well, however as the size of the web applications would grow in complexity, ColdFusion would not always scale properly. For us, this presented when the services would freeze and webpages would stop displaying updates. For the most part, the pages would still render, but new content would get hung up between the submission process and the back-end update process. As a result, we would receive calls that content was not displaying properly and then we would "fix" the problem by restarting the ColdFusion services.

One attempt at proactively "solving" this problem prior to my arrival was to create scheduled tasks in the OS to restart the services automatically every hour, with the two servers in the cluster set to restart 1/2 hour apart. This quelled the problem well enough for awhile, but not long after I arrived, some additional problems started to arise from this. A residual affect of these restarts was that the task queue would collect events that may or may not release properly when the services came back up. So over time, this queue would fill up with events that would then overrun the memory pool, which in turn caused everything to then hang. To resolve this issue, an administrator had to go in and manually clear the queue log - to essentially delete the hung events.

Initially, this was happening once a week or so, but as time went on, it would happen more and more frequently. The point at which it was happening about once a day, we knew we needed a better solution than waiting for a phone call to know if the queue needed cleared out.

The initial solution we arrived at was to see if there was a way to programmatically monitor the queue to watch for the number creeping up. When everything was functioning properly, there should be anywhere from a few events to maybe 100 events if you had a bunch of people submitting changes at the same time. Everything would function just fine though until there were 1000 or more events. So we built an ASP.Net app to just render a simple graphic that displayed green, yellow, red, and purple based on the number of events. Any time that we saw it go red, we knew we needed to go in to clear the queue. So the first step was monitoring the queue on screen.

After running this for a bit, and confirming that it was working correctly, we added a function that would send an email alert as soon as the queue hit red. This way we could be alerted after hours without having to manually keep an eye on things. This at least gave us some freedom from having to check the screen several times a day to see how it was doing. Since it was an ASP.Net app, we could at least check it from a cell phone easily. The second step to this process was proactively sending alerts.

Once we got to this point, I asked the question - is there a way to clear the queue without having to log into the console to do it manually? After some research, we discovered that we could indeed call a function from ASP.Net that we could use to clear the queue. We added this function to the app we created and just populated the logic behind a button on screen, such that when we got an alert we could just pull up the app on whatever computer we were near, including our cell phones, and click the button to clear the queue. This was fantastic on multiple levels, as it was far less work for us now and it could be done easily wherever we were. This way too, instead of one of the administrators always having to hop on their computer to resolve the issue, we were instead able to delegate this to anyone to resolve. We wrote very simple instructions that amounted to "If the screen is red, click the button." The third step to this process was to simplify the process programmatically.

The final step in our process, came rather naturally. We had a button we could push whenever we needed to fix the problem, and we were getting alerts whenever the problem occurred. All we had to do at this point was join the two processes together - whenever it would go to send an alert, why not have it also call the function to clear the queue. In theory then, by the time we got the alert and checked the app, the problem should have already gone away. Once we implemented this step, this specific problem was fully mitigated and virtually eliminated. This last step to this process was automation.

Seeing the benefits derived from this approach to problem solving reinforced this as an approach that could be applied for many future problems (so of which I will cover in later posts). To summarize this approach to troubleshooting and problem solving:
  1. Set up monitoring - figure out a way to detect the problem before it occurs by identifying leading metrics that are indicators of the coming problem
  2. Set up alerting - once you've determined how to monitor the leading indicators, further enhance the process (and response times) by alerting folks that actions need to be taken
  3. Simplify the process - break down the steps to take in such a way that all of the logic can happen behind the scenes, and document the process so others can follow it without having to be experts
  4. Automate the process - once you're confident that the process is working consistently and you've defined it in a way that doesn't require expert intervention, hook the alerting and resolution logic together so that it automatically resolves itself
This process has proven successful time and again in the years since. As I've worked with other teams along the way, we have built systems that applied these same principles and gained tremendous efficiency in the process.

Wednesday, July 22, 2020

Creating My OWASIS - Part 3 (Putting the pieces together and wrap-up)


In this third and final post, I will walk through the various components that went into making OWASIS work. In case you missed them, here are the links to Part One and Part Two.

This part of the process was the actual fun part - writing and assembling the scripts into a semi-cohesive package that could be run on a repeated basis to refresh the inventory information. I figured out in Part Two that I would rely on a combination of Bash and Perl scripting to make this all work. There were still a few minor obstacles to overcome

For one, I wanted all of the data output in a consistent manner, and some of the commands to do this would not render properly if they were just called through a remote, interactive session. So I wrote a script that could be uploaded and then run on any of the servers, which I called Remote.sh. This really formed the core of the inventory system and could be run on any server version and would return the data formatted in a consistent manner. The challenge was how to get this script on to all of the servers.

I decided to tackle the Telnet-only systems first. Since Telnet does not support file transfers, I decided to FTP (ugh, yep, not sFTP since that wasn't available) the Remote.sh script to the server first, then call the script from the Telnet session. This worked nicely and returned the information to the terminal.

The next step was to write a script that would automatically login to Telnet and then execute the Remote.sh script that had been previously sent to the user's home directory via FTP - I called this script AutoTelnet.pl. This script incorporated the previously mentioned "exepect.pm" module instructions to handle sending over the username and password (see security disclaimer in Part Two).

The last piece was to essentially build a loader script that would call these other two. Essentially, all this last script for the Telnet systems did was upload Remote.sh and then execute it by then running the AutoTelnet.pl script - I named this script FTP_Remote.sh (for obvious reasons).

For the SSH servers, I still used Remote.sh to run the commands remotely on all of the servers so that I could capture the data in a consistent manner, but since SSH supports file transfers as well, the process of moving the file and then executing it was very streamlined - and it too leveraged the "expect.pm" module for automating the login process.  I called this script AutoSSH.pl.

These scripts collectively represented the real bones of the OWASIS system. I had to write some additional supporting scripts though to make this as fully automated as possible. This included scripts like nslook.sh which I used to perform an nslookup on all valid hostname ranges (the bank named their servers sequentially, fyi). I used listing.pl to parse the output of nslook.sh and determine what systems support SSH and which only supported Telnet. Another script called Parse2csv.pl was used to scrape the output files from the Remote.sh scripts into a comma separated value file.

As I mentioned in Part Two - and looking back in hindsight - there were many security issues present with the way all of this worked. For one, while I played around with making the collection of the username and password interactive for the person running them to avoid hardcoding these values into the files, I still had to use a configuration file (called ftpers.txt) to store these values for running the Ftp_Remote.sh script. If you mis-typed the password in either the config file, or in the interactive prompts though, it would lock the account. This required a call to the security team (plus a mea culpa) to get the account unlocked. And this worked fine for the most part - except for the systems that were Telnet only - because I would not be able to access FTP until a successful Telnet authentication took place. So I wrote another script that I called AutoTelPasswd.pl that was my get out of jail/unlock account script. Let that run that on all of the Telnet servers and I was back in business!

For anyone that has not lost total interest in all of this at this point (anyone? Beuhler? Beuhler?), here are the original instructions I wrote up on how to run OWASIS:

Open-source WAS Inventory Package Instructions

Note: When doing the following, be SURE to use your correct password - failure to do so, WILL lock your account on all of the machines to attempt to log into
  1. Replace "<username>" and "<password>" in ftpers.txt
  2. Run "./Ftp_Remote.sh"
    1. After it has automatically ftp'd the Remote.sh script to all of the server in tn_ips.txt, it will prompt you for a username and password to use to telnet into all of the machines and run the Remote.sh script
  3. Run "perl AutoSSH.pl ssh_ips.txt"
    1. This can be run concurrently with ./Ftp_Remote.sh, as all of the processing is done remotely, so it will not slow down your local machine.
  4. When Ftp_Remote.sh completes, view the log file in an editor that allows you to do block select mode (textpad or ultraedit32), and block select only the first character in every line of the file, and then delete that block. (This way both log files have the same format)
  5. Run "cat SSH_connections-<datestamp>.log TN_connections-<datestamp>.log > Master_Inventory.txt"
    1. This will populate a single file with all of the output from Telnet and SSH servers
  6. Run "perl Parse2csv.pl Master_Inventory.txt > <output file.csv"
    1. I usually make an output file with a datestamp similar to the tn and ssh_connections files
  7. Open your <output file>.csv file in Excel
    1. There will be three disctinct partitions/ranges to the file
    2. Add text labels above the first row in each partition as follow:
      1. Partition 1: Hostname, Brand, Model#, OS Version, Processor Cores, IP Address
      2. Partition 2: Hostname, WAS Versions
      3. Partition 3: Hostname, WAS Home, Server Name, Server Status
    3. Select all of the cells in the first partition/range, goto Data, then filter - advanced filter; check unique records only, and OK
      1. Repeat for each of the three partitions
    4. Copy and paste each partition (text labels included) into its own sheet of a new Excel Workbook
    5. Rename the three sheets in the new workbook as follows:
      1. Sheet 1: Machine Info
      2. Sheet 2: WAS Versions
      3. Sheet 3: Server Info
    6. Proceed with any formatting, sorting, etc. of your choice
  8. If you so choose, now that you have a well formatted Excel "Database" file, you can import this into Access to run queries against - each sheet is equivalent to a table in a database - hostname is the primary key.




Friday, July 17, 2020

Creating My OWASIS - Part 2 (Solving the problem)


In this second part of "Creating My OWASIS", we will get into the approach I took to solve the problem of how to create an inventory of systems for the bank where I worked. If you missed Part One, which provided background and an overview of my role with the bank, you can find it here.

The assignment, you may recall, was to create an inventory of the existing WebSphere Application Servers deployed at the bank. This included identifying all of the development, test, and production systems and their associated versions of WebSphere Application Server, Linux, and certificate information. At a high-level, one approach could have been just manually logging into each individual server, running commands to find the requested information, and noting it in a spreadsheet. Taking this approach, I probably could have completed the assignment in roughly a week or two. And for those two weeks, my days would amount to arriving to work, logging into my workstation, opening up PuTTy, and then walking through the list of hundreds of systems one at a time, picking up where I left off the day prior.

I don't know about you, but I do not have the energy, attention-span, nor desire to spend this many hours of my life wasted in tedium. Fortunately, all of the servers running WebSphere Application Server are a variety of Linux flavors - so perhaps I could write a script to make this process more efficient (and interesting)?

I spent some time brainstorming what was possible and how it would ideally work. My goal was to make it fully automated (or as close to it as possible) - whereby I could feed in a list of servers and it would automatically login, run some commands, and return back the desired information. I knew I could easily accomplish some of this using Bash scripts, particularly for systems that were running ssh, but I found out early on that there were a shameful number of servers still only running <gasp> telnet </gasp> of all things. Well, I wasn't going to let this lunacy slow me down - there had to be a way around this.

I shared my ideas with a friend of mine, and they gave me the suggestion to take a look at Perl, and specifically to look at using the "expect" module. This proved to be exactly the secret sauce I was looking for.

MAJOR CAVEAT - what you are about to read absolutely pre-dates my time in a security role, and while judgement is certainly allowed (encouraged, in fact), this no longer reflects recommendations that I would give today.

There were several ways that Perl was an attractive option for what I was trying to accomplish. The major strength comes from just the sheer number of modules (what other languages call libraries) available that provide a vast array of functionality from which to draw capabilities. Another major strength of Perl is its ability to parse data from either fixed-format or completely unstructured data. This strength comes from how tightly regular expressions (RegEx) are integrated into the language. This makes it tremendously easier to take output and format it into something useable and then import it into another application (Excel, for example). The last strength is of course the one I mentioned earlier - specifically, the expect.pm module - which can be used for automating processes.

The expect.pm module performs the unique function of building what are essentially cases that fire off depending on what is output to the screen. While my plan was to use this specifically to interact with login prompts and prompts to supply passwords (again - not secure), it could really automate anything that involves "if X is returned, then do Y". Functionally, if you are familiar with IFTTT, then you already have a fundamental grasp on how this works.

By combining the power of Bash, Perl, and Expect.pm, I had all the tools needed to create a package of scripts that could automate from start to finish the process of building out an Open-source WebSphere Application Server Inventory System (aka "OWASIS").

Coming soon will be part 3 of this unnecessarily lengthy topic, where I will walk through each of the components that went into package of scripts.

Apologies to Peter Jackson for stealing his creative process.