Hash tables are for searching what binary trees are for sorting. Hash table data structures can reduce time complexity down to O(1), a linear order of magnitude, at least in practical instances.
Simple Hash functions
Typical hash functions involve the modulo operator, because the remainder is never bigger than the modulo-ed value.
hash(item) = item%tableSize
In case the table size is 11, this would give us the following hash table:
The load factor λ of above table would be the number of stored items divided by the table size: 3/11.
Hash tables are the storage to store an item in a slot of a list with the help of a key. In case there more than one value per key stored, we will get a collision.
For sufficiently many items the number of items will become greater than the number of possible hashes and then you will get collisions causing the performance rise above O(1), for example O(n).
For example in above hash table if you wanted to store 44 and 77 you would get for both the hash 0, resulting in a collision. In case the item collection never changes you could construct a perfect hash function, and you would need to grow the table size to acustom such a function, but in practice this is hardly done, it would waste huge amounts of memory and space; even for numbers that are not extensively long, like social security numbers. This is why we
Because our simple module hash function was not injective we need to come up with something better. There are many solutions to overcome collisions and I shall list a few of them here. In general the hashing function should remain simple and quickly computable or it would defeat the purpose of not using a binary or sequential search in the first place.
Folding simply means splitting the number that is to be stored in pairs, adding them up and modulo them with the table size.
E.g. a phone number: 433-555-4601
= 43 + 65 + 55 + 46 + 01 = 210 % 11 = 1
Square the number to be stored, extract the two middle digits, 93, and run modulo with the table size on it.
44² = 1936
93 % 11 = 5
When collision occurs we can either resolve it by
- probing (closed hash table, one item per slot) or by
- chaining (open hash table, multiple items per slot)
Linear Probing / Open Adressing
When storing, and in case a hash slot is occupied, linear probing looks for a free slot in the hash table until it find one.
When searching for an item in the hash table, the hash value is computed and if the correct item is not in the expected slot, a sequential search is performed until it is either found or an empty slot occurs. In the latter case the algorithm would conclude that the searched item is not in the table.
Also performance quicky degrades for table load factors > 1/2.
Linear probing leads to clustered tables, that means numerous surrounding slots will be filled. To overcome this a "plus 3" storing and probing function is implemented.
With this method we avoid clustering by distributing items more evenly in the table when a collision occurs by moving i² slots aways form the slot of the collision, where i = number of attemps to resolve the collision.
hash(item) = (item % tableSize) + i²
- For λ ≤ ½, quadratic probing will find an empty slot
- For λ > ½, quadratic probing may find a slot (Wolfman, 2000).
When faced with collisions chaining stores multiple items in the same slot, of course when searching for an item, the time complexity to find it with chaining increases as well.
Wolfman, Steve. (2000) CSE 326: Data Structures Lecture 15: When Keys Collide, Washingtonk The University of Washington [online]. Available at: https://courses.cs.washington.edu/courses/cse326/00wi/handouts/lecture16/sld022.htm (Accessed 3 June 2014).
Piwek, P., Walker R., Willis A. (2013) Algorithms, data structures and computability: Hashing and hash tables, Milton Keynes, The Open University.
Miller, B., Ranum, D. (2011) Problem Solving with Algorithms and Data Strucutres using Python, Sherwood, Franklin, Beedle & Associates.