B-trees

Overview of B-trees

B-trees are balanced search trees designed to work well on magnetic disks or other direct-access secondary storage devices. B-trees are similar to red-black trees, but they are better at minimizing disk I/O operations. Many database systems use B-trees, or variants of B-trees, to store information.

B-trees differ from red-black trees in that B-tree nodes may have many children, from a handful to thousands. That is, the "branching factor" of a B-tree can be quite large, although it is usually determined by characteristics of the disk unit used. B-trees are similar to red-black trees in that every n-node B-tree has height O(lg n), although the height of a B-tree can be considerably less than that of a red-black tree because its branching factor can be much larger. Therefore, B-trees can also be used to implement many dynamic-set operations in time O(lg n).

B-trees generalize binary search trees in a natural manner. Figure 18.1 shows a simple B-tree. If an internal B-tree node x contains n[x] keys, then x has n[x] 1 children. The keys in node x are used as dividing points separating the range of keys handled by x into n[x] 1 subranges, each handled by one child of x. When searching for a key in a B-tree, we make an (n[x] 1)-way decision based on comparisons with the n[x] keys stored at node x. The structure of leaf nodes differs from that of internal nodes.

Figure 18.1: A B-tree whose keys are the consonants of English. An internal node x containing n[x] keys has n[x] 1 children. All leaves are at the same depth in the tree. The lightly shaded nodes are examined in a search for the letter R.

Data structures on secondary storage

There are many different technologies available for providing memory capacity in a computer system. The primary memory (or main memory) of a computer system normally consists of silicon memory chips. This technology is typically two orders of magnitude more expensive per bit stored than magnetic storage technology, such as tapes or disks. Most computer systems also have secondary storage based on magnetic disks; the amount of such secondary storage often exceeds the amount of primary memory by at least two orders of magnitude.

Figure 18.2(a) shows a typical disk drive. The drive consists of several platters, which rotate at a constant speed around a common spindle. The surface of each platter is covered with a magnetizable material. Each platter is read or written by a head at the end of an arm. The arms are physically attached, or "ganged" together,and they can move their heads toward or away from the spindle. When a given head is stationary, the surface that passes underneath it is called a track. The read/write heads are vertically aligned at all times, and therefore the set of tracks underneath them are accessed simultaneously. Figure 18.2(b) shows such a set of tracks, which is known as a cylinder.

Figure 18.2: (a) A typical disk drive. It is composed of several platters that rotate around a spindle. Each platter is read and written with a head at the end of an arm. The arms are ganged together so that they move their heads in unison. Here, the arms rotate around a common pivot axis. A track is the surface that passes beneath the read/write head when it is stationary. (b) A cylinder consists of a set of covertical tracks.

Although disks are cheaper and have higher capacity than main memory, they are much, much slower because they have moving parts. There are two components to the mechanical motion: platter rotation and arm movement. As of this writing, commodity disks rotate at speeds of 5400-15,000 revolutions per minute (RPM), with 7200 RPM being the most common. Although 7200 RPM may seem fast, one rotation takes 8.33 milliseconds, which is almost 5 orders of magnitude longer than the 100 nanosecond access times commonly found for silicon memory. In other words, if we have to wait a full rotation for a particular item to come under the read/write head, we could access main memory almost 100,000 times during that span! On average we have to wait for only half a rotation, but still, the difference in access times for silicon memory vs. disks is enormous. Moving the arms also takes some time. As of this writing, average access times for commodity disks are in the range of 3 to 9 milliseconds.

In order to amortize the time spent waiting for mechanical movements, disks access not just one item but several at a time. Information is divided into a number of equal-sized pages of bits that appear consecutively within cylinders, and each disk read or write is of one or more entire pages. For a typical disk, a page might be 2¹¹ to 2¹⁴ bytes in length. Once the read/write head is positioned correctly and the disk has rotated to the beginning of the desired page, reading or writing a magnetic disk is entirely electronic (aside from the rotation of the disk), and large amounts of data can be read or written quickly.

Often, it takes more time to access a page of information and read it from a disk than it takes for the computer to examine all the information read. For this reason, in this chapter we shall look separately at the two principal components of the running time:

the number of disk accesses, and
the CPU (computing) time.

The number of disk accesses is measured in terms of the number of pages of information that need to be read from or written to the disk. We note that disk access time is not constant-it depends on the distance between the current track and the desired track and also on the initial rotational state of the disk. We shall nonetheless use the number of pages read or written as a first-order approximation of the total time spent accessing the disk.

In a typical B-tree application, the amount of data handled is so large that all the data do not fit into main memory at once. The B-tree algorithms copy selected pages from disk into main memory as needed and write back onto disk the pages that have changed. B-tree algorithms are designed so that only a constant number of pages are in main memory at any time; thus, the size of main memory does not limit the size of B-trees that can be handled.

SKEDSOFT

Data structures on secondary storage