Hash Tables
A hash table uses the key of each record to determine the location in an array structure. To do this, the key is passed into a hash function which will then return a numeric value based on the key.
Hash Functions
A hash function must be designed so that given a certain key it will always return the same numeric value. Furthermore, the hash function will ideally distribute all possible keys in the keyspace uniformly over all possible locations.
For example suppose that we wanted to create a table for storing customer information at store. For the key, a customer's telephone number is used. The table can hold up to 10,000 records and thus valid indexes for an array of that size would be [0 - 9999]
Telephone numbers are 10 digits (###) ###-####
The first 3 of which is an area code.
Now, if your hash function was: use the first 4 digits of the phone number (area code + first digit of number) that hash function would not be very good because most people in the same area would have the same area code. Most people in the Toronto for example have area code of 416 or 647... so there would be very little variation in the records. However the last 4 digits of a phone number is much more likely to be different between users (though certainly not unique).
Generally speaking a good hash function should be:
- uniform (all indices are equally likely for the given set of possible keys)
- random (not predictable)
Load Factor
The load factor denoted by the symbol (pronounced lambda) measures the fullness of the hash table. It is calculated by the formula:
Collisions
The pigeon hole principle
Suppose you had n mailboxes and m letters where m > n (more letters than mailboxes). If you were to place all the letters into the available mailboxes, there would be at least one mailbox with at least 2 letters in it. This is the pigeon hole principle.
This is effectively what the situation is with our hash function and keys. The number of mailboxes we have is n (capacity of array). The total number of possible keys is m. Typically, the total number of possible keys is bigger than the capacity of the array. (usually significantly bigger). Based on the pigeon hole principle, it means that we will have situations where at two keys will get hashed into the same hash index.
When two keys have the same hash index you have a collision. Generally speaking, collisions are unavoidable. The key is to have a method of resolving them when they do happen. The rest of this chapter look at different ways to deal with collisions.