1.2. Halfword overflow (continued) - something semiofficial

posted Sep 15, 2010, 5:45 PM by Jeff Ogden   [ updated Dec 25, 2011, 9:15 PM ]
I found the following on the "Anecdotes" page of Josh Simon's Web site and the change log looks pretty real. Unlike some of the other materials about MTS at this site these items are not from the 13 May 1996 issue of UM's IT Digest (the "goodbye to MTS issue").

Problems with the date

In November of 1989, a minor itsy-bitsy bug was discovered in the MTS code. Nothing you'd call major, really. Seems that the United Kingdom-based MTS sites were having all sorts of file system-related problems. Luckily for us in the United States, we had 5 hours before it became midnight locally. The problem was that some of the file system code used an unsigned half-word integer (16 bits) to store the number of days since zero time (March 1, 1900). Unfortunately, the rest of the file system code used a signed half-word integer (15 bits data, 1 bit sign) — and when it became the 32,768th day after zero time, the sign bit flipped and parts of the system thought files were stamped as being created or modified 32,767 days in the future. MTS didn't like this concept, so it caused all sorts of system problems. (The change log comments are available.)

The systems programmers hurriedly patched the file system code to use unsigned half-word integers consistently, recompiled the operating system, and provided patches to the various MTS Consortium sites. (Hewlett-Packard was using a previous version of MTS — Distribution 5.1 instead of the then-current Distribution 6.0 — at one of their sites. We provided them with a binary-only version of the patch and informed them not to trust any previous backups of the operating system.)

Of course, as the senior programmer noted on the systems programmers' mailing list, this solution will only work until the 65535th day after zero time (which maps out to some time in 2061). His comment was that if anyone was still running what would in effect be a century-old operating system then that they got what they deserved. And besides, by 2061, all of the then-current systems programmers would be retired or deceased, so they really didn't much care. (Shades of the Year 2000 problem, huh?)

And the change log:

Change Log

3:15pm 16 November 1989
Problems with the file system, notably $PERMIT and $FILESTATUS, on every system that uses MTS, made emergency reloads a requirement. The UB system was reloaded at 1:53pm and the UM system at 1:25pm.
4:47pm 16 November 1989
More information from the systems staff:

The down-time on the UM and UB systems was due to a bug that was exposed at midnight of November 16, 1989, which was the 32768th day after Mar 1, 1900. The file system uses halfwords to store the number of days since Mar. 1, 1900 for lastref, lastcat, and credat.

A halfword can be used to store values from -32768 to 32767. or from 0 to 65535. Parts of the system assumed the first value range, which caused various other parts of the system to PGNT, or use an incorrect value for evaluating how old a file is. This caused the $PERMIT PGNT, and the problems with HASPLOG and CMDSTAT. $FILESTATUS and $DUPLICATE also suffered from this bug.

So, my take on this is that MTS didn't crash, but the problems were serious enough to require an unscheduled shutdown and reload, which is pretty close to a crash.  And it doesn't appear that UM took advantage of the advanced warning from the UK to avoid the need for an unscheduled shutdown. The sites in Canada had a couple or three more hours of warning, I wonder if it was enough to help?