Red-Black Trees Part 1 - Balanced Trees
By John Lenz. December 27, 2015.
This page is part one of the article series You could have invented red-black trees!.
Our goal is to build a balanced tree. At the moment, let's just think about the structure of the tree ignoring how data might be stored on the nodes. One idea is to track how unbalanced the tree is and if it gets too unbalanced, throw away the tree and rebuild it from scratch to be balanced. Exploring this idea leads to scapegoat trees. Another idea is to constantly fix the tree to be as balanced as possible. At each node, store data about the size of each subtree and if they differ by more than one after an insert or remove, modify the tree to restore balance. Exploring this idea leads to AVL trees. Another idea is to just always keep the tree balanced. Each insert or remove operation will just keep the tree balanced so no balance restoration needs to take place. Exploring this idea will lead eventually to red-black trees.
How to keep the tree balanced? Assume we have a balanced tree, by which I mean that all leaves are at the same depth, and consider how to add a new leaf. We want to add it at the same depth as all the other leaves, which we can easily do if there is space on the branch nodes. Thus instead of a binary tree, we will allow each branch node to store more than two children (see the following figure). Of course, we will need to place some limit on the number of children a branch node can have, since otherwise we will spend too much work deciding which child to visit during a lookup.
For the moment, assume that we decide that a node can have at most four children. If we want to add a new leaf under the node labeled by \(x\), we can just directly add it since \(x\) has only three children. If instead we want to add a new leaf under the node labeled by \(y\), we cannot just add the leaf since \(y\) already has four children. Instead, we will make a new branch node (call it \(y'\)) to hold the new leaf. We don't want branch nodes with only a single child, so \(y'\) will steal a few leaves from \(y\) resulting in two partially filled branch nodes.
If we create this new branch node \(y'\), we must add \(y'\) at the same depth as \(y\) so that the tree stays balanced. Thus, we look at the parent of \(y\). Call the parent of \(y\) by \(z\). If \(z\) has available space, \(y'\) can be added a child. If \(z\) is full, we can create a clone \(z'\) of \(z\) to hold \(y'\) and some children stolen from \(z\). But now \(z'\) must be added at the same depth as \(z\), so we look at the parent of \(z\). Sound familiar? We continue in this fashion up the tree creating new nodes until we find a non-full node or we reach the root. If we reach the root (call the root \(r\)) and \(r\) is full, we create a new node \(r'\) as normal and redistribute the children among \(r\) and \(r'\). Finally, we make a new node to become the root, and make this new node have \(r\) and \(r'\) as children (thus \(r\) is no longer the root).
This procedure keeps the tree balanced. Indeed, care was taken when creating new branch nodes and redistributing children to keep each node at the same depth as it was before the operation. The only place where the height of the tree grows is when we create a new root to live above \(r\) and \(r'\), but in this case every single node in the tree has its depth increased by one, still maintaining that all leaves are at the same depth. Since we can make sure each branch node always has at least two children and the leaves are the same depth, the total height of a tree with \(n\) children is at most \(\log_2 n\).
The remove operation can also be implemented. Say we want to remove a leaf \(v\). If the parent of \(v\) has at least three children (such as \(x\) or \(y\) in the figure), we can just remove \(v\) directly. If the parent of \(v\) has only two children (such as \(w\) in the figure), removing \(v\) would drop \(w\) to have only a single child. We can then redistribute the remaining child of \(w\) child among the neighboring nodes on the same level as \(w\), possibly removing \(w\) completely. If we remove \(w\), then the parent of \(w\) might drop to only a single child. We therefore repeat up the tree. If we reach the root and make the root have only a single child, then the root node can be discarded and this single child can become the new root. I am being deliberately vague here: I encourage you to draw some trees on a piece of paper and consider how you can perform remove by moving around children, always keeping nodes at the same depth and never leaving a branch node with a single child.
There are many ways of directly turning the above ideas into a balanced tree implementation. First, we must decide how to store data on the nodes so that we can find the data later. Also, we must decide how many children a branch node is allowed. There are several variations, including B-trees, 2-3 trees, 2-3-4 trees, and many more variations on these data structures (especially B-trees have many variations).
The above ideas are great in that they maintain a balanced tree, but it isn't a binary tree. There is no major reason to prefer a binary tree, but binary trees are easier to think about and simpler to code. Another minor downside of the balanced trees described above include a large amount of memory allocation and deallocation when branch nodes are created and removed. In short, binary trees make our life simpler in many ways so we would prefer them if possible. The good news is that we can use binary nodes to simulate nodes with more than two children! For example, if we have a node with three children, we can insert an "interior" node to hold two of the children as in the following figure:
When performing this conversion, we color the "interior" node red to signal that it isn't a real node in the original tree. We can perform this conversion for a node with any number of children, creating as many "interior" nodes as we need and coloring them red. Of course, the tree is no longer balanced because of these new red nodes. To keep our tree as balanced as possible, we would like to add as few red nodes as possible. Therefore, you might think that we should limit each original node to at most three children, but if you think about it a minute you can see that if a node has four children, it can be converted by adding two red nodes at the same depth. Converting nodes with four children in this way then only adds a single depth of red nodes during the conversion.
We now have our tree:
- Use a balanced tree with a bound of 4 on the number of children each branch node is allowed.
- Instead of storing this tree directly, store the converted binary tree with each node colored red or black.
- Since we now have a binary tree, the data can be placed into the tree the same as a binary search tree. Each node stores a single element, smaller elements are placed in the left subtree, and larger elements in the right subtree. This is a little weird because we are placing data on the red nodes which don't correspond to original nodes, but we are using the original tree only to help us keep the structure of the tree balanced.
- To insert, we first use the binary search property to decide where to insert a leaf. Next, the insert procedure mimics what the insert procedure would do on the balanced tree, performing operations directly on the red-black tree. That is, the insert procedure can examine the color of the nodes, decide the actions that inserting the leaf would perform on the original balanced tree, and then mimic those actions on the actual binary tree.
- To remove, use the same idea and mimic the actions from the original tree to the binary red-black tree.
While possible to implement, this turns out to be very convoluted. The insert and remove code will be written in terms of if statements and operations on the binary red-black tree, but thinking about why the code is correct requires constant mental conversion between the two trees. It would be better if we could think about and verify that the insert and remove code is correct by reasoning directly about the red-black tree. This would also help us find small optimizations such as reusing tree nodes or shortcuts in the operations. If we are able to reason about correctness directly on the binary red-black tree, these small optimizations become much easier to discover and implement.
To allow these optimizations, we don't want to require that insert on the red-black tree works exactly like it would on the original tree. Instead, we just want to verify that after insert, the red-black tree we end up with is the image of some balanced tree. Clearly not all colored binary trees are the result of converting some balanced tree, so when verifying if insert and remove are correct, we need some way of checking that the red-black binary tree that results from the insert or remove is the image of some balanced tree. The most important feature of the balanced tree is that all leaves are at the same depth and nodes have at least two children. Translated into the binary tree, we obtain the following properties:
- Every root to leaf path has the same number of black nodes. Also, each node has zero or two children.
Next, what property on red nodes will guarantee that if the property is satisfied the binary tree is the image of some balanced tree? We just write down a feature of the red nodes created during the conversion, keeping in mind we want as few red nodes as possible to keep the tree balanced.
- For each node \(v\), it is not the case that both \(v\) and \(v\)'s parent are red.
This property states that we only add a single layer of red "interior" nodes during the conversion process. We now ask ourselves: "if we have a binary tree satisfying the above two properties, is it the image of some balanced tree?" The answer is yes (try and see yourself why this is the case). The insert and remove methods no longer must track exactly the operations performed on the balanced tree and can instead just make sure that they end up with a binary tree satisfying these properties.
If you go ahead and implement using this strategy, you will discover an annoyance when inserting a new node. For example, say that we have the following red-black tree and we want to insert 8.
In the binary search tree, 8 must be added to the left of 9. In the balanced tree, we would add a new child of 11 bumping 11 up to three children. Converting to the red-black tree, we add a new red interior node to simulate the fact that 11 has three children.
But now we have a red node without a value. To keep the binary search tree property, we must move 9 up to be stored at the red node, leaving the node that used to have 9 blank (we could also move up 8 but we have the same problem). We can't take this blank node out of the tree since we would then violate the black-height property and lose our guarantee that the tree is mostly balanced. Keeping around the empty node isn't that great since then depending on the order of inserts we might have a significant fraction of empty nodes in our tree.
The main problem is that inserting a value on a black node caused two nodes to appear but we only need one node to hold the new value. What about adding the new value on a red node?
Unfortunately, the tree we have now is not the result of converting some balanced tree, since the red node has no children and red nodes appear only as "interior" nodes with two black children. But if we add another layer of black nodes across the entire tree, we will get a binary tree that is the result of converting some balanced tree! We must add an entire layer because we need to maintain the black-height property.
You might think this is even worse than having empty nodes, since we are adding a large number of nil nodes to the tree which are not storing any values, but the key observation is that the nil nodes do not need to be kept in memory.
Our strategy now is similar to the above attempt, but the properties are stated in terms of a tree with the nil nodes but phrased in a way that takes into account that the nil nodes are not actually stored in memory (but still considered to exist). A root-to-nil path is a sequence of \(L\) or \(R\) such that starting from the root and descending left or right according to the sequence, you end at nil. That is, the last step in the sequence descends in a direction for which there does not exist a child.
Our properties are then:
- (black-height property) Each root-to-nil path has the same number of black nodes.
- (no-red-edge property) For each node \(v\), it is not the case that both \(v\) and \(v\)'s parent are red.
Note that we no longer must enforce that each node has zero or two children, because the black-height property will guarantee that if a node has one child, it must be a red leaf child. Also, there is a little ambiguity here: are the nil nodes counted as an actual node when counting the number of black nodes along root-to-nil paths? Because of the way we have defined root-to-nil paths, it doesn't matter if the nil node is counted or not because there is one nil node on each root-to-nil path. It turns out that during remove it is easier to think of the nil node as actually counting, so that is what usually is done in presentations of red-black trees. But at this point in developing the data structure, it doesn't matter which way as long as we are consistent.
Definition A red-black tree is a binary search tree where in addition each node is colored red or black so that the black-height and no-red-edge properties hold. The black-height of a red-black tree is the number of black nodes on a root-to-nil path (which is the same for all paths).
Theorem If \(v\) is the root of a red-black tree, then the height of the tree rooted at \(v\) is at most \(2\log_2 n\). In addition, it is possible to insert and remove from a red-black tree while maintaining the black-height and no-red-edge properties.
It is important to highlight that we could come up with all sorts of properties that guarantee the tree is balanced, but if we can't (efficiently) insert and remove while maintaining the properties, they are not all that useful. We now have our strategy to implement the tree: store a red-black tree and code up the insert and remove methods to maintain the black-height and no-red-edge properties.