Saturday, May 23, 2020

We're Out of Serial



One of the more frustrating and stressful events I will share in this blog (to-date anyway) involved the Sun Solaris 10 SPARC systems that were running the Oracle database server for our ERP system.

By this point in my career, I was hired into my first full-time computer-related job as a Systems Administrator in the central IT department. I was given the assignment of being the primary SysAdmin for the ERP implementation project, and I had approximately 4 weeks plus a training course worth of experience running Solaris 10 servers prior to this. Needless to say, I was in a bit over my head.

The event that occurred stemmed from the Sun cluster that was used to provide high-availability for the database - where there were two servers in the cluster and the database could failover between servers with a floating IP address that would move with the active instance. Mind you, this was entirely new architecture within IT, and I had one other coworker who had some semblance of Solaris experience, though none beyond version 9. Cool.

Some details of this event are a bit foggy after 14 years, but I will share what I can.

The first indication there was a problem was the phone ringing on my desk concurrent to instant messages blowing up on my computer. They were developers and a DBA from the ERP project team, and it seems they were no longer able connect to the servers. Not good. I asked them if they were using the right IP address (yes, for whatever reason, these folks never took to using DNS names, and they had a habit of pointing to the static IP rather than the float IP). They assured me they tried all of the IPs (which I bought, because normally they would just say they were using the correct one and it would magically start working once they had me on the phone).

I tried connecting myself - to no avail - and determined that indeed there was an issue; it was particularly telling that ALL of the static IPs and the float IP were timing out across both servers. This was very not good.

There is something you need to know about Sun servers - when you first bring them up, you can either connect to a local serial port for the initial boot and setup, or you can use DHCP to automatically assign an IP and connect remotely. We used DHCP, so it was easy enough to setup from a remote session. Unlike any other server in our data center though, these did not have a VGA connector into which you could plug in a monitor and keyboard when remote access fails. Typically though with Sun server, once you got them set up that first time, you never needed to touch the serial port again; they were pinnacles of reliability.

Until they weren't.

It was about this point that I could feel the color leaving my skin, as I had never had to touch the serial port on these before. Did we even have the right connector for them? Possibly. Did I know how to use it? NOPE.

Sun would ship one fancy metal rs232 db9 to rj45 connector per server. And since our ERP ran on 10 or so Sun servers, we fortunately had plenty of the db9 to rj45 connectors around. Easy enough, find a standard cat-5 patch cable with rj45 terminations on both end, plug in one of the db9 to rj45 connectors on each end, connect one to the server the other to your laptop, and everything's peachy.... Not so fast. Plug all this in together, and there is nothing on screen - how could this be? Maybe one of the connectors was faulty? I tried several others, no dice.

A coworker had a usb to serial connector (don't ask why, no clue), so I asked to borrow that. Plugged it in, fired up Hyperterm and I got mostly gibberish. Sure there were characters on the screen, and I could type, but nothing was happening. It did not seem to recognize any of the commands I sent. It was progress, but only barely. This is not good. My suspicion was that the pin-out for the Sun server and the expected transmission pins in the USB to serial connector were not the same, and Hyperterm was not sophisticated enough to renegotiate that serial connection to make it work.

(Quick aside - looking back on this with significantly more experience, there were probably changes I could have made in Hyperterm to get that USB connector work properly. Oh well, didn’t happen.)

Why wouldn't Sun just ship a db9 to an ANYTHING YOU COULD FRIGGIN' PLUG INTO A LAPTOP connector???

After a bit of research (aka, Googling frantically for any lead I could find), I found a pin-out table for the Sun db9 connector (Like this one). That's when it dawned on me - it's the same problem you have with network cables! You cannot connect just a standard patch cable from one server to another, you have to use a cross-over cable so your transmit and receive pins line up properly.

For the next 45 minutes, I found myself cracking open the case of one of the db9 connectors, pulling out all of the pins, and then matching opposing pins to the associated cross-over equivalent. I plug one end of this Frankenstein cable to the server, the other end into the serial port of my laptop, fire up Hyperterm, and up appears a scrolling list of errors (recognizable as some of the previous gibberish) and this time, I could send commands and receive responses - amidst the scrolling errors. I send a simple "init 6" command to each of the servers, and when they come back up the issue seems to have resolved. Still no clue as to what caused the cluster to lose its mind, but at least I now had a cable to use should the same fiasco happen again.

No comments:

Post a Comment