Detailed explanation of common lookup data structures and algorithms (Python implementation)

I. Basic concepts

Finding (Searching) is based on a given value, in the lookup table to determine a keyword equal to the given value of the data elements (or records).

Search Table: A collection of data elements (or records) of the same type.

Keyword (Key): the value of a data item in a data element, also known as the key value.

Primary Key: A keyword that uniquely identifies a data element or record.

Lookup tables are categorized according to their mode of operation:

Static Search Table (Static Search Table): a lookup table that only does lookup operations. Its main operations are:
Queries whether a "specific" data element is in the table.
Retrieving a "specific" data element and various attributes
Dynamic Search Table (Dynamic Search Table): In the search at the same time as the insertion or deletion and other operations:
Inserting data while searching
Deleting data on lookup

II. Unordered Table Lookup

That is, a linear lookup with unsorted data, traversing the data elements.

Algorithm analysis: the best case is found in the first position, this is O(1); the worst case is found in the last position, this is O(n); so the average number of lookups is (n+1)/2. The final time complexity is O(n)

# The most basic lookup algorithm for traversing an unordered list
# Time complexity O(n)

def sequential_search(lis, key):
  length = len(lis)
  for i in range(length):
    if lis[i] == key:
      return i
    else:
      return False


if __name__ == '__main__':
  LIST = [1, 5, 8, 123, 22, 54, 7, 99, 300, 222]
  result = sequential_search(LIST, 123)
  print(result)

III. Ordered Table Lookup

The data in the lookup table must be sorted somehow by some primary key!

1. Binary Search

Algorithm core: constantly take the middle element in the lookup table and compare it with the lookup value to reduce the table range by a factor of two.

# Bisection lookup algorithm for ordered lookup tables
# Time complexity O(log(n))

def binary_search(lis, key):
  low = 0
  high = len(lis) - 1
  time = 0
  while low < high:
    time += 1
    mid = int((low + high) / 2)
    if key < lis[mid]:
      high = mid - 1
    elif key > lis[mid]:
      low = mid + 1
    else:
      # of times halving is printed
      print("times: %s" % time)
      return mid
  print("times: %s" % time)
  return False

if __name__ == '__main__':
  LIST = [1, 5, 7, 8, 22, 54, 99, 123, 200, 222, 444]
  result = binary_search(LIST, 99)
  print(result)

2. Interpolation search

Dichotomous lookup is good enough, but there are still areas that can be optimized.

There are times when half-filtering isn't hard enough; wouldn't it be better to exclude nine-tenths of the data every time? Choosing this value is the key issue, and interpolation means just that: downsizing at a faster rate.

The core of interpolation is the use of formulas:

value = (key - list[low])/(list[high] - list[low])

Use this value to replace 1/2 in a binary lookup.

The code above can be used directly, with one sentence change.

# Interpolated lookup algorithm
# Time complexity O(log(n))

def binary_search(lis, key):
  low = 0
  high = len(lis) - 1
  time = 0
  while low < high:
    time += 1
    # Calculating the mid value is the core code of the interpolation algorithm
    mid = low + int((high - low) * (key - lis[low])/(lis[high] - lis[low]))
    print("mid=%s, low=%s, high=%s" % (mid, low, high))
    if key < lis[mid]:
      high = mid - 1
    elif key > lis[mid]:
      low = mid + 1
    else:
      # Print the number of lookups
      print("times: %s" % time)
      return mid
  print("times: %s" % time)
  return False

if __name__ == '__main__':
  LIST = [1, 5, 7, 8, 22, 54, 99, 123, 200, 222, 444]
  result = binary_search(LIST, 444)
  print(result)

The overall time complexity of the interpolation algorithm is still in the O(log(n)) class. The advantage is that for lookup tables with a large amount of data in the table and a relatively uniform distribution of keywords, the average performance of using the interpolation algorithm is much better than bisection lookup. Conversely, for data with extremely uneven distribution, the interpolation algorithm is not suitable.

3. Fibonacci search

Inspired by the interpolation algorithm, the Fibonacci algorithm was invented. It also centers on how to optimize that shrinkage rate so that the number of lookups is as low as possible.

Using this algorithm assumes that there is already a list containing Fibonacci data

F = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,...]

# Fibonacci lookup algorithm
# Time complexity O(log(n))

def fibonacci_search(lis, key):
  # A ready-made Fibonacci list is required. The value of its largest element must exceed the value of the number of elements in the lookup table.
  F = [1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,
     233, 377, 610, 987, 1597, 2584, 4181, 6765,
     10946, 17711, 28657, 46368]
  low = 0
  high = len(lis) - 1
  
  # To make the lookup table satisfy the Fibonacci property, add a few of the same values at the end of the table
  # This value is the value of the last element of the original lookup table
  # The number of additions is determined by F[k]-1-high
  k = 0
  while high > F[k]-1:
    k += 1
  print(k)
  i = high
  while F[k]-1 > i:
    (lis[high])
    i += 1
  print(lis)
  
  # Algorithm main logic. time is used to show the number of loops.
  time = 0
  while low <= high:
    time += 1
    # To prevent F-list subscript overflow, set up if and else
    if k < 2:
      mid = low
    else:
      mid = low + F[k-1]-1
    
    print("low=%s, mid=%s, high=%s" % (low, mid, high))
    if key < lis[mid]:
      high = mid - 1
      k -= 1
    elif key > lis[mid]:
      low = mid + 1
      k -= 2
    else:
      if mid <= high:
        # Print the number of lookups
        print("times: %s" % time)
        return mid
      else:
        print("times: %s" % time)
        return high
  print("times: %s" % time)
  return False

if __name__ == '__main__':
  LIST = [1, 5, 7, 8, 22, 54, 99, 123, 200, 222, 444]
  result = fibonacci_search(LIST, 444)
  print(result)

Algorithm analysis: the overall time complexity of Fibonacci lookup is also O(log(n)). But in terms of average performance, it is better than binary lookup. But in the worst case, for example, if the key is 1, it is always in the left half of the search, at this time its efficiency is lower than the binary search.

To summarize: the mid operation of binary lookup is addition and division, interpolation lookup is a complex quadratic operation, and Fibonacci lookup is just the simplest addition and subtraction operation. In the massive data search, this subtle difference may affect the final search efficiency. Therefore, the three ordered table lookup method is essentially a different choice of partition point, each with its own advantages and disadvantages, should be selected according to the actual situation.

IV. Linear Index Lookup

For massive unordered data, indexed tables are generally constructed for them in order to improve lookup speed.

Indexing is the process of associating a keyword with its corresponding record.

An index comprises a number of index entries, each of which contains at least information such as keywords and the location of their corresponding records in memory.

Indexes can be categorized according to their structure: linear indexes, tree indexes and multi-level indexes.

Linear index: a collection of index entries is organized through a linear structure, also called an index table.

Linear indexes can be categorized into: dense indexes, chunked indexes and inverted indexes

1. Dense index

A dense index refers to a linear index in which an index entry is created for each record in a collection of data.

This is actually equivalent to creating an ordered linear table for an unordered set. Its index entries must be ordered by keycode.

This is also equivalent to bringing forward the sorting required by the lookup process.

1. Chunking index

Chunking a large collection of unordered data makes it unordered within chunks and ordered between chunks.

This is actually an intermediate or compromise state between ordered lookup and unordered lookup. Because the amount of data is too large, the establishment of a complete dense index is time-consuming and labor-intensive, taking up too many resources; but if you do not do any sorting or indexing, then the traversal of the lookup is also unacceptable, can only be a compromise, to do a certain degree of sorting or indexing.

Chunk indexing is a bit more efficient than the O(n) of traversal lookup, but still quite a bit worse than the O(logn) of bisection lookup.

1. Inverted index

Instead of the record determining the attribute value, the attribute value determines the location of the record, and this is known as an inverted index. Where the record number table stores the addresses or references of all records that have the same subkeyword (either a pointer to a record or the primary keyword for that record).

Inverted indexing is the most basic search engine indexing technique.

V. Binary sorted trees

A binary sorted tree is also known as a binary lookup tree. It is either an empty tree or a binary tree with the following properties:

If its left subtree is not empty, the values of all nodes in the left subtree are less than the value of its root structure;
If its right subtree is not empty, the values of all nodes in the right subtree are greater than the value of its root structure;
Its left and right subtrees are also binary sorted trees respectively.

The purpose of constructing a binary sort tree is often not to sort, but to increase the speed of finding and inserting and deleting keywords.

Operations on binary sorted trees:

1. Find: Compare the value of the node and the keyword, equal to the node is found; small to the node's left sub-tree to find, large to the right sub-tree to find, so recursive, and finally return to the Boolean value or find the node.

2. Insert: start from the root node one by one with the keywords for comparison, small go to the left, large go to the right, encounter the case of empty subtree will be the new node link.

3. Delete: If the node to be deleted is a leaf, delete it directly; if there is only a left subtree or only a right subtree, delete the node and link the subtree to the parent node; if there is a left and right subtree at the same time, you can traverse the binary sorted tree in the middle order to take the node to be deleted, and the position of the node that will be deleted by the predecessor or successor node instead of the deleted node.

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# Author: Liu Jiang
# Python 3.5


class BSTNode:
  """
  Define a binary tree node class.
  Focuses on discussing algorithms, ignoring some issues such as making judgments about data types.
  """
  def __init__(self, data, left=None, right=None):
    """
    Initialization
    :param data: data stored by the node
    :param left: left subtree of the node
    :param right: right subtree of the node
    """
     = data
     = left
     = right


class BinarySortTree:
  """
  A binary sorted tree based on the BSTNode class. Maintains a pointer to the root node.
  """
  def __init__(self):
    self._root = None

  def is_empty(self):
    return self._root is None

  def search(self, key):
    """
    Key Retrieval
    :param key: key code
    :return: query node or None
    """
    bt = self._root
    while bt:
      entry = 
      if key < entry:
        bt = 
      elif key > entry:
        bt = 
      else:
        return entry
    return None

  def insert(self, key):
    """
    Insertion operations
    :param key: key code
    :return: boolean
    """
    bt = self._root
    if not bt:
      self._root = BSTNode(key)
      return
    while True:
      entry = 
      if key < entry:
        if  is None:
           = BSTNode(key)
          return
        bt = 
      elif key > entry:
        if  is None:
           = BSTNode(key)
          return
        bt = 
      else:
         = key
        return

  def delete(self, key):
    """
    The most complex method for binary sorted trees
    :param key: key code
    :return: boolean
    """
    p, q = None, self._root   # Maintain p as the parent of q for later linking operations
    if not q:
      print("Empty tree!")
      return
    while q and  != key:
      p = q
      if key < :
        q = 
      else:
        q = 
      if not q:        # End exit when there is no keycode key in the tree.
        return
    # The node to be deleted has been found above, referenced by q. And p is the parent of q or None (when q is the root node).
    if not :
      if p is None:
        self._root = 
      elif q is :
         = 
      else:
         = 
      return
    # Find the rightmost node of the left subtree of node q and link the right subtree of q to the right subtree of that node
    # The method may increase the depth of the tree and is not really efficient. Other methods can be designed.
    r = 
    while :
      r = 
     = 
    if p is None:
      self._root = 
    elif  is q:
       = 
    else:
       = 

  def __iter__(self):
    """
    Implementing a middle-order traversal algorithm for binary trees.
    Showing the binary sorted tree we created.
    Use python's built-in list directly as a stack.
    :return: data
    """
    stack = []
    node = self._root
    while node or stack:
      while node:
        (node)
        node = 
      node = ()
      yield 
      node = 


if __name__ == '__main__':
  lis = [62, 58, 88, 48, 73, 99, 35, 51, 93, 29, 37, 49, 56, 36, 50]
  bs_tree = BinarySortTree()
  for i in range(len(lis)):
    bs_tree.insert(lis[i])
  # bs_tree.insert(100)
  bs_tree.delete(58)
  for i in bs_tree:
    print(i, end=" ")
  # print("\n", bs_tree.search(4))

Summary of binary sorted trees:

Binary sorted trees are stored in chains, maintaining the advantages of linked structures for insertion and deletion operations.
In the extreme case, the number of queries is 1, but the maximum number of operations does not exceed the depth of the tree. In other words, the lookup performance of a binary sort tree depends on the shape of the binary sort tree, which leads to the balanced binary tree that follows.
Given a collection of elements, different binary sorted trees can be constructed, and when it is also a complete binary tree, the time complexity of the lookup is O(log(n)), which approximates a bisection lookup.
When the most extreme diagonal tree occurs, it has a time complexity of O(n), which is equivalent to a sequential lookup and is the least effective.

VI. Balanced binary trees

Balanced binary tree (AVL tree, the inventor's initials): a highly balanced sorted binary tree in which the difference in height between the left and right children of each node is at most equal to one.

A balanced binary tree must first be a binary sorted tree!

Balance Factor (Balance Factor): the value that subtracts the depth of the left subtree from the depth of the right subtree of a node on a binary tree.

For balanced binary tree all the balancing factors including branch nodes and leaf nodes are only possible to be -1,0 and 1. As long as there is a node whose factor is not within these three values, the binary tree is unbalanced.

Minimum unbalanced subtree: the subtree whose node is closest to the insertion node and whose absolute value of the balance factor is greater than 1 is the root.

Idea of balanced binary tree construction: Whenever a new node is inserted, first check whether the balance of the tree is damaged, if so, find out the minimum unbalanced subtree. Under the premise of maintaining the characteristics of the binary sort tree, adjust the connection relationship between the nodes in the least unbalanced subtree, rotate accordingly, and become a new balanced subtree.

The following is a balanced binary tree constructed from [1,2,3,4,5,6,7,10,9]

VII. Multiple lookup tree (B-tree)

Multi-way search tree (muitl-way search tree): each of its nodes can have more than two children, and multiple elements can be stored at each node.
For a multi-way lookup tree, how many elements can be stored at each node and how many children it has is key, and these four forms are commonly used: 2-3 trees, 2-3-4 trees, B-trees, and B+-trees.

2-3 trees

2-3 tree: each node has 2 children, or 3 children, or no children.

A 2-node contains one element and two children (or no children, not just one). Similar to a binary sorted tree, its left child contains all elements less than that element, and its right child contains all elements greater than that element.

A 3-node contains two elements and three children (or no children, not just one or two).

2-3 All leaves in the tree must be on the same level.

The insertion operation is as follows:

The deletion operation is as follows:

2-3-4 trees

It is actually an extension of the 2-3 tree that includes the use of 4-nodes. A 4-node contains three elements, small, medium and large, and four children (or no children).

Its insertion operation.

its delete operation:

B-tree

A B-tree is a balanced multi-way lookup tree. The maximum number of children of a node is called the order of the B-tree. 2-3 tree is a 3rd order B-tree and 2-3-4 is a 4th order B-tree.

The B-tree data structure is mainly used in data interaction between memory and external memory.

B+ Tree

In order to solve the basic problems such as traversal of all elements of a B-tree, a B+-tree is formed after adding a new way of organizing the elements on the basis of the original structure.

B+-tree is a kind of B-tree that emerges in response to the needs of the file system, and strictly speaking, it is no longer the most basic tree.

Elements appearing in a branch node in a B+ tree are listed again as if they were in the middle-order successor (leaf node) of that branch node position. In addition, each leaf node keeps a pointer to the subsequent leaf node.

All leaf nodes contain all the information of the keywords and the related pointers, and the leaf nodes themselves are linked according to the size of the keywords in descending order.

The structure of a B+ tree is particularly well suited to lookups with ranges. For example, finding people between the ages of 20 and 30.

VIII. Hash tables

Hash table: all the elements do not have any relationship with each other. The storage location of the elements is calculated directly by some function using the keywords of the elements. This one-to-one relationship function is called the hash function or Hash function.

Records are stored in a contiguous piece of storage called a hash table or hash table (Hash Table) using hashing techniques. The storage location corresponding to the keyword is called the hash address.

A hash table is a lookup-oriented storage structure. It is best suited for solving problems that involve finding records that are equal to a given value. However, it is not suitable for situations where a keyword can correspond to many records, such as finding all "men". It is also not suitable for range searches, such as finding people between the ages of 20 and 30. Sorting, max, min, etc. are also inappropriate.

As a result, hash tables are often used for data structures where the keyword is not repeated. For example, python's dictionary data type.

Designing a simple, homogeneous hash function with high storage utilization is the most critical problem in hashing techniques.

However, general hash functions face the problem of conflicts.

Conflict: the phenomenon where two different keywords are computed by a hash function with the same result. collision.

8.1 Construction methods for hash functions

Good hash function: easy to compute, even distribution of hash addresses

1. Direct siting method

For example take some linear function of a keyword to be a hash function:

f(key) = a*key + b (a,b are constants)

2. Numerical analysis

Extract the number in the keyword and assign the address according to the characteristics of the number

3. Squaring the center

Square the number of the keyword and then intercept the portion

4. Folding method

A means of playing with numbers by splitting the numbers of keywords and calculating them separately and then combining them.

5. Divide by the remainder method

One of the most common methods.

For a collection of data with table length m, the hash formula is:

f(key) = key mod p (p<=m)

mod: taking the modulus (finding the remainder)

The most critical aspect of the method is the choice of p, and conflicts are inevitable when the amount of data is large. A prime number close to m is usually chosen.

6. Random number method

Selects a random number and takes the value of the keyword's random function as its hash address.

f(key) = random(key)

To summarize, in practice, different hashing methods are used depending on the characteristics of the data, considering some of the main issues below:

Time required to calculate the hash address
Length of keywords
Size of the hash table
Distribution of keywords
Frequency of record searches

8.2 Handling hash conflicts

1. Open siting method

It is to find the next empty hash address as soon as a conflict occurs, and as long as the hash table is large enough, the empty hash address will always be found and the record will be deposited.

The formula is:

This simple conflict resolution is known as linear probing, which is nothing more than visiting the pits behind you, one by one, as they become available, regardless of whether or not the pit is booked by someone behind you.

The biggest problem that comes with linear probing is the buildup of conflicts; you take the pit that someone else intended, and someone else has to find the pit just like you did.

Improvements include quadratic probing and random number probing.

2. Re-hash function method

When a conflict occurs, it is calculated by another hash function, and there is always one that can resolve the conflict, which can make the keyword not to produce aggregation, but accordingly increase the time of calculation.

3、Link address method
When a conflict is encountered, instead of replacing the address, all records whose keywords are synonyms are stored in a chained table, and only the head pointer of the synonym sub-table is stored in the hash table, as follows:

The advantage of this is that there is no fear of many conflicts; the disadvantage is that it reduces the random storage performance of the hash structure. The essence is to use a single linked table structure to assist the shortcomings of the hash structure.

4. Public overflow zone law

In fact, it is to open an extra piece of storage for all the conflicts. It is more appropriate to use this method if there is very little conflicting data relative to the base table.

8.3 Scattered list lookup implementation

Here's a simple piece of code that implements it:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# Author: Liu Jiang
# Python 3.5
# Ignore judgment on data types, element overflows, etc.


class HashTable:
  def __init__(self, size):
     = [None for i in range(size)] # Use the list data structure as a method of saving hash table elements
     = size # Maximum table length

  def hash(self, key):
    return key %  # The hash function uses the divide-and-retain remainder method

  def insert_hash(self, key):
    """Insert keywords into the hash table."""
    address = (key) # Find the hash address
    while [address]: # There is already data at the current location, a conflict has occurred.
      address = (address+1) %  # Linear detection of next address availability
    [address] = key # Directly saved if there is no conflict.

  def search_hash(self, key):
    """Find keyword, return boolean"""
    star = address = (key)
    while [address] != key:
      address = (address + 1) % 
      if not [address] or address == star: # It means it's not found or it's looped to the beginning.
        return False
    return True


if __name__ == '__main__':
  list_a = [12, 67, 56, 16, 25, 37, 22, 29, 15, 47, 48, 34]
  hash_table = HashTable(12)
  for i in list_a:
    hash_table.insert_hash(i)

  for i in hash_table.elem:
    if i:
      print((i, hash_table.(i)), end=" ")
  print("\n")

  print(hash_table.search_hash(15))
  print(hash_table.search_hash(33))

8.4 Hash table lookup performance analysis

If no conflict occurs, its lookup time complexity is O(1), which is among the most extreme well.

However, in reality, conflicts can be unavoidable, and the following three aspects have a large impact on lookup performance:

Is the hash function uniform
Approaches to conflict
Filling factor of the hash table (the extent to which the data in the table are filled)

This is the whole content of this article.