B-Tree
Last updated
Last updated
B-tree is a special form of balanced binary search tree (BBST) designed for fast disk access or I/O operations, and it's well employed in various commercial or open source databases like MySQL, Oracle, MongoDB etc. Unlike many binary search trees, B-tree has a branching factor spanning from dozens to thousands which means a tree node can have more than two direct descendants. Before diving into the details of B-tree operations, it is important to first understand differences between disk-based data structures and memory-based ones.
In the present, there are two major types of secondary storage: Hard Disk Drive (HDD) and Solid State Drive (SSD), both of which have random access speed considerably slower than RAM. (see: here). Despite of the fact that SSD is much faster than HDD in terms of access speed, costs of SSD per GB is much more than that of the HDD.
Then, the computer I/O model can be approximated as the following picture, where the secondary storage access has time is much higher than main memory and CPU:
Figure 1. A Simplified Computer I/O Model
And since the majority of data items are not stored in the memory, it is critical to have fewer I/O access to the disk drives to save as much time as possible. The data structures in disk have to be made differently than in the main memory.
A typical HDD will look like the following picture:
Figure 2. Hard Disk Drive Internal Look
It consists of one or many platters which all spin around a common spindle. The drive moves a magnetic arm to read/write data from the underlying platters using its head. When the platters are spinning and the head remains still, the traveling trajectory of the head makes a circle called track, which is further divided into multiple sectors with equally distributed physical spaces for data read/write.
Normally, there are three steps to access disk sectors: seek, latency and transfer.
seek: the time it takes for the head to move to a relevant track.
latency: the round trip time for the platters to spin for a cycle under the same read/write head.
transfer: read/write operations on disk sectors.
The time costs ranking from high to low: seek > latency > transfer. And every I/O access requires physical address lookup on the disk and triggers operations of seek while platters are spinning and so forth. Therefore, it is essential to reduce the total number of times the disk I/O access is triggered, especially seek operations in between.
In addition to above, modern computers always pre-read data in advance rather than read by need. Even though sometimes we only read 1 byte, the system still retrieve some length of data into the memory. (see locality of reference) The pre-read data item normally spans as a multiple of page size (4k in many operating systems, as data exchange unit between disk and memory) along a relevant sector (which does not involve seek)
For a B-tree data structure, each node possesses a space of a multiple of page size (computers allocate _page_s side by side physically) to have a one time I/O guarantee with maximal data needed. Since the B-tree has a very large branching factor, it has far smaller height than normal BBSTs such as red-black tree, which means a key search in B-tree would require traversing far fewer nodes. As a indexing data structure in disk, B-tree is highly efficient than normal BBSTs.
A B-tree T with branching factor M is a rooted tree that has the following properties:
Each node x has (up to) M - 1 keys and stored in non-descendent order that x.key1 ⩽ x.key2 ⩽ ... ⩽ x.keym-1.
In each node, there are M number of pointers to their children in between wherein each key will have a left pointer and a right pointer.
Keys in a node split the ranges of their children's keys: e.g. there is a pointer between two adjacent keys 3 and 7, then the child node will have keys ranging from 3 to 7 exclusively.
Each leaf node has the same depth, which is the height of the tree.
Define a constant minimum degree d (as the order of B-tree) to represent the minimum number of keys in a non-leaf node. (B-tree typically has d at least M/2)
A graphical example with M = 4 (ignoring data items for convenience, in real implementations there are items associated with keys):
Imagine there are frequent query requests sent to the database servers or file systems for specific data, some parts of indexes should stay in memory or cache to have better response time. B+ tree is more suitable than B-tree in this scenario.
In a B+ tree, all internal nodes have no data and only keys and ONLY leaves maintain all data items instead. In this way, one I/O access of a single page would retrieve a node with more indexes, therefore B+ tree is more disk-friendly (loading nodes into the memory without considering of varying data object sizes). Moreover, B+ tree is a threaded tree which means all its leaf nodes are connected and better for range query (e.g. search for a date range within the data)
MongoDB Indexes use B tree: https://docs.mongodb.com/manual/indexes/#id2
B-Tree vs Hash Table indexing in MySQL: https://stackoverflow.com/questions/7306316/b-tree-vs-hash-table
Figure 3. A B-tree Example