15.8 Hash Tables
A hash table is a data-structure that uses a hash function to compute an index into an array. The most common form of hash table, and the easiest to explain, is one that combines an array of pointers with a set of linked-lists. The basic form of this type of hash table is shown in Figure 15.5. Each pointer in the array of pointers points to the head of a singly linked-list. A list may be empty, in which case the pointer is NULL. Each element in the array is called a “bucket”, and the list pointed to by a bucket is called a “chain”.Figure 15.5: Hash table. An array of pointers (buckets) point to singly linked-lists (chains). A hash function converts an item value into a bucket index, which restricts search to the attached chain. Each chain may be zero length or greater.
The operation of a hash table is as follows. Given an item of data to be stored in the table, the hash function computes an index based on the value of this item. The index is such that it falls within the bounds of the pointer array, and so specifies one of the buckets. The selected bucket points to a linked-list, which is searched to check whether the item is already stored, otherwise the item is added to the front of the list.
The key to an e cient hash table is a good hash function. Essentially it should distribute items evenly between the di erent buckets and not favour any particular bucket over another. The derivation of hash functions involves fairly advanced mathematics including aspects of probability theory and prime number theory. Also, the length of the array of pointers should not be arbitrary but itself a prime number. We will not discuss these issues further as they are beyond the scope of this text.
A hash table is useful for implementing very fast lookup tables. The hash function computes a bucket index in O(1) time and, assuming the chain is short, the linear link-list search is very quick. Provided the hash function distributes items evenly, the chains will be short enough so that the entire operation may be considered O(1). Thus, on average, hash tables permit O(1) lookup, although in the worst case, where the hash function places all items in a single bucket, lookup can be O(n).
In the following example we use a hash table to implement a dictionary. The dictionary is built up of words and their associated definitions. These word-definition pairs are stored in the hash table using the word as a search key. The hash table permits fast insertion, search and deletion, and the public interface for this module is shown below.
1 typedef struct Dictionary t Dictionary;
2
3 Dictionary *create table(void);
4 void destroy table(Dictionary *);
5
6 int add word(Dictionary *, const char *key, const char *defn); 7 char *find word(const Dictionary *, const char *key);
8 void delete word(Dictionary *, const char *key);
The next section of code shows part of the private interface. The #define (line 1) specifies the size of the array of pointers (i.e., the number of buckets). Notice this value is prime. Lines 3 to 7
define the link-list node type. Each node in a chain contains a word, its definition, and a pointer to the next node. The Dictionary type (lines 9 to 11) contains an array of pointers to the chain nodes; this is the array of buckets. The hash function (lines 13 to 22) is a complicated device, and we will not elaborate on its workings here. It was obtained from [KR88, page 144], and to quote this text, it “is not the best possible hash function, but it is short and e ective.” Notice that it takes a string argument and converts it to a bucket index.
1 #define HASHSIZE 101
2
3 struct Nlist {
4 char *word; /* search word */
5 char *defn; /* word definition */
6 struct Nlist *next; /* pointer to next entry in chain */
7 };
8
9 struct Dictionary t {
10 struct Nlist *table[HASHSIZE]; /* table is an array of pointers to entries */
11 };
12
13 static unsigned hash function(const char *str)
14 /* Hashing function converts a string to an index within hash table. */
15 {
16 const int HashValue = 31;
17 unsigned h;
18
19 for (h = 0; *str != ’\0’; ++str)
20 h = *str + HashValue * h;
21 return h % HASHSIZE;
22 }
The two functions that follow demonstrate the main workings of the hash table algorithm. The first, and most instructive, is add_word(), which takes a word-definition pair and adds it to the table. If the word is already stored, the old definition is replaced with the new one, otherwise, if the word is not found, a new table entry is added. The second function, find_word(), uses the same search mechanism as add_word() to determine if a word is stored in the table and, if so, returns the associated definition.
1 int add word(Dictionary *dict, const char *key, const char *defn)
2 /* Add new word to table. Replaces old definition if word already exists. 3 * Return 0 if successful, and -1 is fails. */
4 {
5unsigned i = hash function(key); /* get table index */
6struct Nlist *pnode = dict->table[i];
7
8while (pnode && strcmp(pnode->word, key) != 0) /* search chain */
9 pnode = pnode->next;
10
11 if (pnode) { /* match found, replace definition */
12 char *str = allocate string(defn);
13 if (str == NULL) /* allocation fails, return fail and keep old defn */
14 return -1;
15 free(pnode->defn);
16
17 pnode->defn = str;
18 }
19 else { /* no match, add new entry to head of chain */
20 pnode = makenode(key, defn);
21 if (pnode == NULL)
22 return -1;
23 pnode->next = dict->table[i];
24
25 dict->table[i] = pnode;
26 }
27 return 0;
28 }
29
30 char *find word(const Dictionary *dict, const char *key)
31 /* Find definition for keyword. Return NULL if key not found. */
32 {
33 unsigned i = hash function(key); /* get table index */
34 struct Nlist *pnode = dict->table[i];
35 while (pnode && strcmp(pnode->word, key) != 0) /* search index chain */
36
37 pnode = pnode->next;
38
39 if (pnode) /* match found */
40 return pnode->defn;
41 return NULL;
42 }
5–6 The word is passed to the hash function, which computes an array index. This bucket, in turn, points to the head of a node chain.
8–9 We search the length of the chain looking for a string match between the keyword and a node word.
11–18 If the node pointer is not NULL then a match was found before the end of the chain. Thus, we replace the old definition with the new one. (Note, the function allocate_string() is part of the module’s private interface and makes a duplicate of the passed string.)
19–26 If the end of the chain was reached, then the keyword is new and is added to the head of the chain. (Note, the function makenode() is part of the module’s private interface; it creates and initialises a Nlist node.)
33–37 This code is identical to lines 5 to 9 in add_word().
39–41 If the keyword is found, return a pointer to its definition, otherwise return NULL.
The above example demonstrates a specific hash table implementation for storing a specific type of data (i.e., strings). Writing a generic version of a hash table is not trivial because di erent data types typically require di erent hash functions. However, a generic implementation is possible by using a function pointer to permit user-supplied hash functions. (A default hash function might be called if the user passes NULL.)
As always, deciding whether a hash table is the right data-structure for a particular problem is a matter of considering tradeo s. Hash tables provide very fast O(1) add, delete, and find operations on average, if supplied with an e ective hash function. However, they can have bad O(n) worst-case behaviour, and in some circumstances the O(log n) worst-case complexity of a balanced tree (e.g., a red-black tree) might be preferred.
No comments:
Post a Comment