CSCI 161: Binary search trees (overview)

One of the most common, and most useful, data types is the tree.

A tree is composed of many connected nodes or elements, just as in stacks, lists, and queues, but the interconnection arrangement is different.

In a typical tree organization, each node consists of some data and some pointers to other nodes, referred to as children of the current node.

If the tree held integer data, then a typical graphical representation might look something like:

                  3
                 /|\
                / | \       (the origin of the term "tree" 
               1  4  9       is obvious if you look at the
              /  / \  \      structure upside down)
             /  /   \  \
            6  0     8  5
               |       /
               |      /
               2     7

We often refer to the "top" node in such a representation as the root of the tree. The root is the only node which is not the child of some other node.

At the other end, any node which does not have any children is often referred to as a leaf.

The level of a particular node is its distance from the root, while the height (or depth) of the tree is the number of levels in the tree.

In the tree shown above:

the tree has 10 nodes, and a height of 4
node 3 is the root node and has three children (1, 4, and 9),
node 4 is at level one and has two children (0 and 8),
node 1 is at level one and has one child (6),
node 9 is at level one and has one child (5),
node 5 is at level two and has one child (7),
node 0 is at level two and has one child (2),
nodes 6, and 8 are leaves (i.e. have no children) at level two, while nodes 2 and 7 are leaves at level three

Each child of a particular node is also said to represent a particular subtree - i.e. a smaller tree "underneath" the current node.

Trees are typically organized so that different kinds of data are stored in different subtrees, allowing us to categorize and access our data in a more efficient fashion.

The most common operations associated with trees are

inserting data into the tree,
searching the tree for specific data,
traversing the tree (visiting and processing each node in the tree), and
deleting data from the tree.

Tree applications

Trees have a wide variety of uses:

storing data in an implicitly sorted fashion to enable fast searches (we will do this with binary search trees over the next two weeks)
storing arithmetic expressions for easy evaluation (we will also study this, along with a stack-based algorithm for evaluating different kinds of expression)
storing language constructs for parsing (analysis) (e.g. breaking natural language sentences into phrases, each of which is rooted in a child node, then breaking the phrases down into smaller components which are stored further below in the subtrees, etc)
organizing large sets of computation paths for searching (e.g. represent a set of possible results from a chess game: the root represents the current board position, then a separate child would represent the board after each possible next move, then their children would represent the board following the opponent's response, etc)

Binary Tree ADT

As with most data types, the manipulation of the data type is much simpler (though less flexible) if we establish a precise structure for the data type.

In the case of trees, the most common type of tree is the binary tree.

A tree is said to be a binary tree if every node in the tree has at most two children. (I.e. each node has zero, one, or two children.)

In binary trees we refer to the two children (where they exist) as the left child and the right child.

For example, the following is a valid binary tree:

           9
          / \
         /   \
        3     7
       / \     \
      11  1     4
               /
              6
             / \
            4   1

In this tree, node 7 is the right child of node 9, node 11 is the left child of node 3, etc.

One of the most effective uses for binary trees is storing and retrieving data that can be sorted on some key value. (E.g. names that can be sorted alphabetically, records that can be sorted by some numeric identifier, etc.)

For this purpose, we will use a special category of binary trees call binary search trees, or BST.

A tree is a binary search tree if the following holds:

there exists some data/key field in each tree node, and the less than/greater than relationship has been defined for comparing these data elements
for each node in the tree, the data value stored in the node must be greater than the data value stored in every node in its left subtree
for each node in the tree, the data value stored in the node must be less than or equal to the data value stored in every node in its right subtree

Examples of valid binary search trees:

      5                9   0
     / \              /     \
    /   \            8       0
   2     12         /         \
  / \      \       7           1
 1   4      14    /             \
           /     6               4
         13                     /
                               3
                                \
                                 1

Examples of invalid binary search trees:

      5                9   0          0
     / \              /     \          \
    /   \            8       0          1
   2     12         /         \          \
  / \      \       7           1          0
 1   5      14    /             \
           /     7               4
         13                     /
                               5

In the first invalid case, we see the value 5 in the left subtree of the root value of 5 ... remember everything in the left subtree is supposed to be smaller than the root value
A similar problem holds in the second invalid case: 7 is shown in the left subtree of 7
And yet again in the third case, 5 is shown in the left subtree of 4
The reverse problem is shown in the fourth case, 0 is shown in the right subtree of 1, yet everything in the right subtree is supposed to be at least as large as its ancestors

Initially, we will consider the following operations for BSTs:

constructors and destructors
isempty: this returns true if the tree is empty, false otherwise
isfull: this returns true if the tree is full, false otherwise
insert: this takes a data value, creates a new tree node, and inserts the node into the tree in a way that maintains all the required BST properties.
search: this takes a data value, searches the tree for the value, and returns true if the tree contains the data value, false otherwise
print: this prints all the data values in the tree in sorted order, using an inorder traversal to be discussed later
delete: this takes a data value and searches the tree for the first node it can find (if any) which contains that data value. It then deletes the node in a way that maintains all the required BST properties.

Insertion operations on BSTs

We will build our BSTs incrementally - starting with an empty tree and adding one element at a time.

As we do this, we need to ensure that we maintain the BST properties, i.e. we always insert elements on the correct "side" of the current element.

The algorithm to do this is recursive, starting with the root node of the tree:

Let C be the data value of the current node, and
    D be the data value being inserted

If the tree is currently empty then simply create a node to contain
   D and make that the root of the tree, 
otherwise perform the following:

   If D is less than C then
      if the left subtree is currently empty
         create a new node containing D and make it 
         the left child of the current node
      otherwise call insert, with value D, on
         the left child of the current node
   
   Otherwise (D is greater than or equal to C)
       if the right subtree is currently empty
          create a new node containing D and make it
          the right child of the current node
       otherwise call insert, with value D, on
          the right child of the current node

Observe: suppose we were to insert values 3, 5, 2, 4, 1, 6, and 7 in that order:

Initially we store 3 as the only tree value

         3

Then, since 5 is greater than 3 it becomes the right child of 3

         3
          \ 
           5

Next, since 2 is less than 3 it becomes the left child of 3

         3
        / \
       2   5

Next, since 4 is greater than 3 it is inserted in the right subtree of 3,
   and since 4 is then less than 5 it becomes the left child of 5

         3
        / \
       2   5
          /
         4

Next, since 1 is less than 3 it is inserted in the left subtree of 3,
   and since it is less than 2 it becomes the left child of 2

         3
        / \
       2   5
      /   /
     1   4

Next, since 6 is greater than 3 it is inserted in the right subtree of 3,
   and since it is greater than 5 it becomes the right child of 5

         3
        / \
       2   5
      /   / \
     1   4   6

Finally, since 7 is greater than 3 it is inserted in the right subtree of 3,
   and since it is greater than 5 it is inserted in the right subtree of 5,
   and since it is greater than 6 it becomes the right child of 6

         3
        / \
       2   5
      /   / \
     1   4   6
              \
               7

A note on the efficiency of insert:

Since we only examine one node at each level while we search for a spot to insert a new value, the number of calls to insert is less than or equal to the height of the tree.

If the tree is very dense (i.e. most nodes have two children) then the height of the tree is roughly log₂(N), where N is the number of nodes in the tree. (Specifically, a completely full tree of height h contains 2^h-1-1 nodes.)

This means that, for a full tree of a million elements, the number of recursive calls is still only about 20 ... far better than having to do a search/insert on a sorted list, where we would have to traverse (on average) a half-million elements!

Unfortunately the insert algorithm we've discussed so far does not ensure that the tree created will be "almost full": suppose we were given the elements in sorted order, then the tree would look like

In this case, each new insert would involve traversing all the previously entered values, which would be about the same as the old "sorted list" insert.

Searching BSTs

If we know that we are working with a correctly-formatted binary search tree, then there is a very simple recursive algorithm to search for a specific data value:

If the data in the current node matches the search key then we are done
otherwise:
   if the data in the current node is greater than the search key
      then we must search the left subtree:
           if the left subtree is empty then there is no match
           otherwise make a recursive call on the left child and key
   otherwise we must search the right subtree:
           if the right subtree is empty then there is no match
           otherwise make a recursive call on the right child and key

Efficiency: as with the insertion process, the search process can at worst involve following a single path from the root node to a leaf. Thus the time taken to search the tree is bound by the height of the tree.

Again, if the tree is (close to) full, the search time is on the order of log₂(N) for a tree with N elements, but if the tree is very sparse the search time is on the order of N.

Traversing BSTs: preorder, inorder, postorder

A tree traversal is simply the process of visiting each node in the tree exactly once, for the purpose of applying some processing to the nodes.

Typical applications include:

printing the tree contents
performing some form of update on the tree contents
gathering tree statistics (computing the number of nodes in the tree or height of the tree)
etc

Tree traversals can be performed very efficiently with simple recursive procedures, and there are three primary types of traversal:

preorder traversals:

process the current node's data
make a recursive call to process the left subtree
make a recursive call to process the right subtree

inorder traversals

make a recursive call to process the left subtree
process the current node's data
make a recursive call to process the right subtree

postorder traversals

make a recursive call to process the left subtree
make a recursive call to process the right subtree
process the current node's data

Note that the only difference in the three traversals is the time at which we process the data for the current element.

We will consider preorder and postorder traversals when we consider the evaluation of expression trees, but for binary search trees we are primarily interested in the inorder traversals.

If an inorder traversal is applied to a binary search tree then the data values contained in the tree are processed in sorted order.

This makes sense if we consider the tree structure and the nature of the traversal:

process the left subtree
   - i.e. process elements smaller than the current one
process the current element
process the right subtree
   - i.e. process elements larger than the current one

Depth and size of BSTs

Again, it is relatively simple to create recursive algorithms which determine the height or size (number of nodes) in a binary tree. (The two algorithms below work regardless of whether or not the tree also happens to be a search tree.)

To determine the size of a tree, we simply determine the size of the two subtrees (using recursive calls), add these together, then add one (for the current node).

size = 1
if the left child is not null,
   then make a recursive call to determine the size
        of the left subtree, and add this to size
if the right child is not null,
   then make a recursive call to determine the size
        of the right subtree, and add this to size
return the size

To determine the height of the tree, we determine which of the two subtrees has greater height and add one to this value

current height = 0
if the left child is not null,
   then make a recursive call to determine its height,
        and add this value to current height
if the right child is not null,
   then make a recursive call to determine its height,
        and if this value is greater than the current height
        then update the current height
add one to the current height and return this value

In terms of efficiency, the running time for each of these algorithms is proportional to the size of the tree.

Finding largest and smallest values

In our delete routines and tree reorganization routines we will want to be able to find the largest or smallest values in some subtrees, and again relatively simple recursive algorithms suffice:

Consider the following trees, look for the largest and smallest values, and a pattern that defines them:

      3  3         8              4
     /    \       / \            / \
    2      4     /   \          /   \
   /        \   6     10       2     6
  1          4   \   /        / \   / \
                  3 9        1   3 5   7

By noting that larger values (if any) must always be to the right, and smaller values (if any) must always be to the left, we arrive at the following algorithms:

FindLargest:
   if there is no right child then 
      return the current element
   otherwise call recursively on the right child

FindSmallest:
   if there is no left child then
      return the current element
   otherwise call recursively on the left child

Creating the mirror image of a tree

Suppose we decided that we wanted to reverse the structure of a tree - creating its mirror image.

Again, this is possible with a relatively simple recursive function:
- swap the current node's left subtree with its right subtree
- call recursively on the left child
- call recursively on the right child

Deleting elements from BSTs

Suppose we have a binary search tree, and we decide we want to remove a specific value: it is easy if the value is in a leaf node (just eliminate the node), but what do we do if it is not in a leaf?

Consider the tree below, what needs to happen to delete value 6?

If we delete 6, we can't simply move one of the children up a level, we need to do something more advanced.

Observe that the smallest value in the right subtree of 6 will still be larger than every value in the left subtree of 6, and will be at least as small as everything else in the right subtree of 6.

Suppose, then, we find the smallest value in the right subtree, replace 6 with that value, then go delete the old copy of that value?

In the above example we would wind up with another valid binary search tree:

Note that if the right subtree of 6 had been empty, we could instead have simply replaced 6 with its left child.

This leads us to the following simple algorithm for deletion:

   - if the node has no children then simply delete it
   - otherwise, if the node has only one child then replace the deleted node
     with its child
   - otherwise, look in the right subtree of the node to be deleted,
        finding the node with the smallest value, call it N
     replace the node-to-be-deleted with the newly selected node N,
        and make sure you remove N from its old location down the 
	right subtree