CSCI 161: Binary search trees (overview)

One of the most common, and most useful, data types is the tree.

A tree is composed of many connected nodes or elements, just as in stacks, lists, and queues, but the interconnection arrangement is different.

In a typical tree organization, each node consists of some data and some pointers to other nodes, referred to as children of the current node.

If the tree held integer data, then a typical graphical representation might look something like:

                  3
                 /|\
                / | \       (the origin of the term "tree" 
               1  4  9       is obvious if you look at the
              /  / \  \      structure upside down)
             /  /   \  \
            6  0     8  5
               |       /
               |      /
               2     7
We often refer to the "top" node in such a representation as the root of the tree. The root is the only node which is not the child of some other node.

At the other end, any node which does not have any children is often referred to as a leaf.

The level of a particular node is its distance from the root, while the height (or depth) of the tree is the number of levels in the tree.

In the tree shown above:

Each child of a particular node is also said to represent a particular subtree - i.e. a smaller tree "underneath" the current node.

Trees are typically organized so that different kinds of data are stored in different subtrees, allowing us to categorize and access our data in a more efficient fashion.

The most common operations associated with trees are

Tree applications

Trees have a wide variety of uses:

Binary Tree ADT

As with most data types, the manipulation of the data type is much simpler (though less flexible) if we establish a precise structure for the data type.

In the case of trees, the most common type of tree is the binary tree.

A tree is said to be a binary tree if every node in the tree has at most two children. (I.e. each node has zero, one, or two children.)

In binary trees we refer to the two children (where they exist) as the left child and the right child.

For example, the following is a valid binary tree:

           9
          / \
         /   \
        3     7
       / \     \
      11  1     4
               /
              6
             / \
            4   1
In this tree, node 7 is the right child of node 9, node 11 is the left child of node 3, etc.

Binary Search Tree ADT

One of the most effective uses for binary trees is storing and retrieving data that can be sorted on some key value. (E.g. names that can be sorted alphabetically, records that can be sorted by some numeric identifier, etc.)

For this purpose, we will use a special category of binary trees call binary search trees, or BST.

A tree is a binary search tree if the following holds:

Examples of valid binary search trees:

      5                9   0
     / \              /     \
    /   \            8       0
   2     12         /         \
  / \      \       7           1
 1   4      14    /             \
           /     6               4
         13                     /
                               3
                                \
                                 1
Examples of invalid binary search trees:
      5                9   0          0
     / \              /     \          \
    /   \            8       0          1
   2     12         /         \          \
  / \      \       7           1          0
 1   5      14    /             \
           /     7               4
         13                     /
                               5

Initially, we will consider the following operations for BSTs:

Insertion operations on BSTs

We will build our BSTs incrementally - starting with an empty tree and adding one element at a time.

As we do this, we need to ensure that we maintain the BST properties, i.e. we always insert elements on the correct "side" of the current element.

The algorithm to do this is recursive, starting with the root node of the tree:

Let C be the data value of the current node, and
    D be the data value being inserted

If the tree is currently empty then simply create a node to contain
   D and make that the root of the tree, 
otherwise perform the following:

   If D is less than C then
      if the left subtree is currently empty
         create a new node containing D and make it 
         the left child of the current node
      otherwise call insert, with value D, on
         the left child of the current node
   
   Otherwise (D is greater than or equal to C)
       if the right subtree is currently empty
          create a new node containing D and make it
          the right child of the current node
       otherwise call insert, with value D, on
          the right child of the current node
Observe: suppose we were to insert values 3, 5, 2, 4, 1, 6, and 7 in that order:
Initially we store 3 as the only tree value

         3

Then, since 5 is greater than 3 it becomes the right child of 3

         3
          \ 
           5

Next, since 2 is less than 3 it becomes the left child of 3

         3
        / \
       2   5

Next, since 4 is greater than 3 it is inserted in the right subtree of 3,
   and since 4 is then less than 5 it becomes the left child of 5

         3
        / \
       2   5
          /
         4

Next, since 1 is less than 3 it is inserted in the left subtree of 3,
   and since it is less than 2 it becomes the left child of 2

         3
        / \
       2   5
      /   /
     1   4

Next, since 6 is greater than 3 it is inserted in the right subtree of 3,
   and since it is greater than 5 it becomes the right child of 5

         3
        / \
       2   5
      /   / \
     1   4   6

Finally, since 7 is greater than 3 it is inserted in the right subtree of 3,
   and since it is greater than 5 it is inserted in the right subtree of 5,
   and since it is greater than 6 it becomes the right child of 6

         3
        / \
       2   5
      /   / \
     1   4   6
              \
               7
A note on the efficiency of insert:

Since we only examine one node at each level while we search for a spot to insert a new value, the number of calls to insert is less than or equal to the height of the tree.

If the tree is very dense (i.e. most nodes have two children) then the height of the tree is roughly log2(N), where N is the number of nodes in the tree. (Specifically, a completely full tree of height h contains 2h-1-1 nodes.)

This means that, for a full tree of a million elements, the number of recursive calls is still only about 20 ... far better than having to do a search/insert on a sorted list, where we would have to traverse (on average) a half-million elements!

Unfortunately the insert algorithm we've discussed so far does not ensure that the tree created will be "almost full": suppose we were given the elements in sorted order, then the tree would look like

   1
    \
     2
      \
       3
        \
         4
         etc
In this case, each new insert would involve traversing all the previously entered values, which would be about the same as the old "sorted list" insert.

Searching BSTs

If we know that we are working with a correctly-formatted binary search tree, then there is a very simple recursive algorithm to search for a specific data value:
If the data in the current node matches the search key then we are done
otherwise:
   if the data in the current node is greater than the search key
      then we must search the left subtree:
           if the left subtree is empty then there is no match
           otherwise make a recursive call on the left child and key
   otherwise we must search the right subtree:
           if the right subtree is empty then there is no match
           otherwise make a recursive call on the right child and key
Efficiency: as with the insertion process, the search process can at worst involve following a single path from the root node to a leaf. Thus the time taken to search the tree is bound by the height of the tree.

Again, if the tree is (close to) full, the search time is on the order of log2(N) for a tree with N elements, but if the tree is very sparse the search time is on the order of N.

Traversing BSTs: preorder, inorder, postorder

A tree traversal is simply the process of visiting each node in the tree exactly once, for the purpose of applying some processing to the nodes.

Typical applications include:

Tree traversals can be performed very efficiently with simple recursive procedures, and there are three primary types of traversal:

Note that the only difference in the three traversals is the time at which we process the data for the current element.

We will consider preorder and postorder traversals when we consider the evaluation of expression trees, but for binary search trees we are primarily interested in the inorder traversals.

If an inorder traversal is applied to a binary search tree then the data values contained in the tree are processed in sorted order.

This makes sense if we consider the tree structure and the nature of the traversal:

process the left subtree
   - i.e. process elements smaller than the current one
process the current element
process the right subtree
   - i.e. process elements larger than the current one

Depth and size of BSTs

Again, it is relatively simple to create recursive algorithms which determine the height or size (number of nodes) in a binary tree. (The two algorithms below work regardless of whether or not the tree also happens to be a search tree.)

To determine the size of a tree, we simply determine the size of the two subtrees (using recursive calls), add these together, then add one (for the current node).

size = 1
if the left child is not null,
   then make a recursive call to determine the size
        of the left subtree, and add this to size
if the right child is not null,
   then make a recursive call to determine the size
        of the right subtree, and add this to size
return the size
To determine the height of the tree, we determine which of the two subtrees has greater height and add one to this value
current height = 0
if the left child is not null,
   then make a recursive call to determine its height,
        and add this value to current height
if the right child is not null,
   then make a recursive call to determine its height,
        and if this value is greater than the current height
        then update the current height
add one to the current height and return this value
In terms of efficiency, the running time for each of these algorithms is proportional to the size of the tree.

Finding largest and smallest values

In our delete routines and tree reorganization routines we will want to be able to find the largest or smallest values in some subtrees, and again relatively simple recursive algorithms suffice:

Consider the following trees, look for the largest and smallest values, and a pattern that defines them:

      3  3         8              4
     /    \       / \            / \
    2      4     /   \          /   \
   /        \   6     10       2     6
  1          4   \   /        / \   / \
                  3 9        1   3 5   7
By noting that larger values (if any) must always be to the right, and smaller values (if any) must always be to the left, we arrive at the following algorithms:
FindLargest:
   if there is no right child then 
      return the current element
   otherwise call recursively on the right child

FindSmallest:
   if there is no left child then
      return the current element
   otherwise call recursively on the left child
Creating the mirror image of a tree

Suppose we decided that we wanted to reverse the structure of a tree - creating its mirror image.

Again, this is possible with a relatively simple recursive function:
- swap the current node's left subtree with its right subtree
- call recursively on the left child
- call recursively on the right child

Deleting elements from BSTs

Suppose we have a binary search tree, and we decide we want to remove a specific value: it is easy if the value is in a leaf node (just eliminate the node), but what do we do if it is not in a leaf?

Consider the tree below, what needs to happen to delete value 6?

         5
        / \
       /   \
      /     \
     3       6
            / \
           /   \
          4     8
           \   / \
            5 7   8
If we delete 6, we can't simply move one of the children up a level, we need to do something more advanced.

Observe that the smallest value in the right subtree of 6 will still be larger than every value in the left subtree of 6, and will be at least as small as everything else in the right subtree of 6.

Suppose, then, we find the smallest value in the right subtree, replace 6 with that value, then go delete the old copy of that value?

In the above example we would wind up with another valid binary search tree:

         5
        / \
       /   \
      /     \
     3       7
            / \
           /   \
          4     8
           \     \
            5     8
Note that if the right subtree of 6 had been empty, we could instead have simply replaced 6 with its left child.

This leads us to the following simple algorithm for deletion:

   - if the node has no children then simply delete it
   - otherwise, if the node has only one child then replace the deleted node
     with its child
   - otherwise, look in the right subtree of the node to be deleted,
        finding the node with the smallest value, call it N
     replace the node-to-be-deleted with the newly selected node N,
        and make sure you remove N from its old location down the 
	right subtree