x'21' (33): Taxir

posted Dec 12, 2014, 8:14 AM by Jeff Ogden   [ updated May 21, 2016, 10:10 PM ]
MTS is mentioned on a web page written by Steve Tolkin (the same person who was the inspiration for the MTS Editor's message "not yet Tolkin, not yet"). The web page is: "Tolkin talking -- Software and Speculations" at http://home.comcast.net/~tolkin.family/.

 . . .

Historically first paper on bitwise storage in databases

The TAXIR system was probably the first database to use bitwise storage. The bitwise storage technique is similar to bit-map indexes, in that each item is represented using bits, rather than a sequence of contiguous bytes, as in most conventional databases. They also share the common characteristic that the data is organized by columns (attributes), rather than by rows (records). The key difference is that in bitwise storage there is only one copy of the data, whereas a bit-map index is another physically stored access structure (the "index") in addition to the storage used for the "base table".

Several modern systems including Sybase IQ and Nucleus use this technique, which can dramatically improve performance by reducing both space (memory) and time. In fact both Sybase IQ and Nucleus have been granted patents in this area. But the basic ideas date back more than thirty years. Another very early system that used bit-map indexes (although not bitwise storage) is Model 204 from CCA, which is still being used today.

"The Theory of the TAXIR Accessioner" by George F. Estabrook and Robert C. Brill was published in Mathematical Biosciences 5 (1969) pp. 327-340. At the time of the publication the system had already been operational for several years, at the University of Colorado. Development was later based at the University of Michigan Computing Center. TAXIR was deployed at the several dozen sites running the MTS operating system, and became defunct along with it in the early 1990's. Note that TAXIR predates the relational model of databases. It was a "flat-file" system that (in modern terms) could do select and project, but not join.

Because this paper was published in a journal read primarily by biologists it was largely overlooked by the computer science community. It is not included in the large DBLP Computer Science Bibliography. Nor is this paper included in Gio Wiederhold's database bibliography "from the dark ages to 1991". The paper is still of interest today in part because of the recent activity in systems using bitwise storage. The paper differs from many contemporary publications in the computer science literature in that it both gives the details of the data structures and algorithms needed to implement the system, and also proves its claims for correctness, completeness, and (nearly) optimal space utilization.

I am making the paper available in several formats, due to the varying capabilities of browsers, etc. (Thanks to Elsevier for granting permission.) The first format represents certain special characters using the HTML 4.0 standard entity names. However not all the browsers fully support this standard yet, including Netscape 4 and Microsoft Internet Explorer 5.

To decide which format to use look here: ⊕

If you see a small plus sign in a circle then you can use format 1. If instead you see the literal string + or a small box, or anything else, then you should use format 2 when in a browser. Format 3, Postscript, will look the best, provided you have a way to print and/or view a PostScript file. In particular the "overscore" character is correctly displayed as a little bar directly over the character it applies to, as in conventional mathematics, rather than before it, as in the browser. However this format needs to be printed to a Postscript printer, or viewed using a tool such as Ghostscript or Adobe Acrobat. (The PostScript file was generated using the Perl script html2ps. In all three formats the rendering of the equations in the Appendix is not perfect, because there is not yet a standard way to markup math. Maybe later this could be done using MathML.)

It is available in the following formats:
  1. HTML using entities
  2. HTML using symbol font
  3. Postscript
  4. Word for Windows 95 (but not up to date!)

Here is my writeup that describes the bitwise storage technique, as used by TAXIR. This does not describe the exact techniques used by other systems.

 . . .


Most of the links in the above text are broken. Here are some new links that don't replace the broken links, but which may be helpful never-the-less:


Taxir, the first column oriented database, in 1969 paper
a note by Steve Tolkin in DbaSpot, 19 November 2003

In another thread (on Alterian) Nigel Pendse and others discussed the
idea of a column oriented approach to storing data.

This "column oriented" approach to database is very old, dating to
before 1969. On my home page http://home.comcast.net/~tolkin.family I
have put a copy of a paper that is the earliest description of this
technique. Here is the start of that page:

Tolkin talking -- Software and Speculations

Highlights -- Bitwise storage in databases

The TAXIR system was probably the first database to use bitwise
storage. The bitwise storage technique is similar to bit-map indexes,
in that each item is represented using bits, rather than a sequence of
contiguous bytes, as in most conventional databases. They also share
the common characteristic that the data is organized by columns
(attributes), rather than by rows (records). The key difference is
that in bitwise storage there is only one copy of the data, whereas a
bit-map index is another physically stored access structure (the
"index") in addition to the storage used for the "base table".

Several modern systems including Sybase IQ and Nucleus use this
technique, which can dramatically improve performance by reducing both
space (memory) and time. In fact both Sybase IQ and Nucleus have been
granted patents in this area. But the basic ideas date back more than
thirty years. Another very early system that used bit-map indexes
(although not bitwise storage) is Model 204 from CCA, which is still
being used today.

"The Theory of the TAXIR Accessioner" by George F. Estabrook and
Robert C. Brill was published in Mathematical Biosciences 5 (1969)
pp. 327-340.

....

Then I provide links to the actual paper in several formats: HTML,
Postscript, and Word.


P.S. I posted this paper, the earliest to describe these techniques,
so it would not be lost. I believe it is is historical importance.
Unfortunately when my home pages were moved from
home.attbi.com/~tolkin to http://home.comcast.net/~tolkin.family all
the links were broken amd so it is no longer found by Google and other
search engines. So on a personal note, if people can link to this
page so Google etc. can find it again, I would appreciate it.


Hopefully helpfully yours,
Steve
--
Steven Tolkin Steve . Tolkin at Fmr . Com 617-563-0516
Fidelity Investments 82 Devonshire St. V4D Boston MA 02109
There is nothing so practical as a good theory. Comments are by me,
not Fidelity Investments, its subsidiaries or affiliates.

"Nigel Pendse" writes:

> I'm not really an expert on COP databases, and it doesn't help they don't
> use consistent terminology, but I think that Aleri, Sand, smartFOCUS, Sybase
> IQ, Synera, etc probably come into the category. But that doesn't mean that
> any are exactly the same as Alterian, as each approaches the problem in a
> different way.
>
>
> "IanS" wrote in message
> news:bpda06$1nsv$1@news.wplus.net
>> Thanks Nigel. Your reply has cleared a few things up for me.
>>
>> Can you list a few suppliers of COP databases please. Search engines
>> are struggling with the Column-Oriented
>> Technology description.
>>
>> thanks
>>
>> Ian
>>
>>
>> "Nigel Pendse" wrote in message news:1068839173.53782.0@ersa.uk.clara.net...
>>> CBAT is Alterian's name for I what prefer to call Column-Oriented
>>> Technology, or COP. It's not a methodology but a type of optimised
>>> database. Alterian's implementation is proprietary and unique, but
>>> there are a number of other COP products available that do a similar
>>> job. So, to answer you question, you can buy other databases based
>>> on similar principles, but they won't describe themselves as CBAT as
>>> that's Alterian's own name.
>>>
>>> You can think of COP as being a category of high-performance
>>> products that are used for the analysis of large volumes of
>>> transaction information, in the same way as MOLAP is used for the
>>> analysis of multidimensional data. Both categories use special,
>>> highly-optimised databases that, for the particular apps for which
>>> they are intended, are far faster and more functional than
>>> general-purpose alternatives such as relational databases. But this
>>> doesn't man that COP and MOLAP are rhe same thing.
>>>
[rest snipped]
Comments