The Story That Explains Hash Tables
A smart cinema pre-computes each person's seat from their ticket number. Your ticket number is 7421? Seat = 7421 % 100 = 21. The usher walks directly to row 21 — no scanning, no shouting.
That computation (ticket % capacity) is the hash function. The numbered rows are the buckets. The whole cinema is a Hash Table — built for instant, direct access.
A Hash Table (Hash Map) maps keys to values using a hash
function to compute an array index directly. Average-case insert, delete, and lookup are all
O(1) — constant time regardless of how many items are stored.
Python's dict and set are hash tables under the hood.
Arrays give O(1) access by position. Hash tables give O(1) access by any key — a string, an integer, a tuple. They're the go-to structure any time you need to count, group, cache, or deduplicate data fast.
How a Hash Table Works — The 4 Phases
hash()).
This converts any key — string, int, tuple — into a large integer.
Same key always produces the same hash. Different keys usually produce different hashes.
When two different keys produce the same index, that's a collision. The chaining strategy stores both as a list in that bucket and scans it on lookup. Python's dict avoids this entirely for most workloads using a high-quality hash function and rehashing at ~⅔ load factor.
Collision Strategies Side-by-Side
| Each bucket | holds a linked list |
| Collision | append to list — O(1) |
| Lookup | scan list — O(L) where L = chain length |
| Memory | overhead per node (pointer + data) |
| Used by | Java HashMap, most textbook impls |
| Each bucket | holds at most one entry |
| Collision | probe next slot — O(1) avg |
| Lookup | probe sequence — O(1) avg |
| Memory | compact flat array, no pointers |
| Used by | Python dict, C++ unordered_map |
Load Factor & Rehashing
Load factor α = n / capacity — the fraction of buckets in use. As α grows, collision probability rises and average lookup time degrades. Python's dict rehashes (doubles capacity and reinserts all entries) when α exceeds roughly ⅔ (0.67).
In latency-sensitive code, if you know you'll insert N items, pre-populate with
d = dict.fromkeys(known_keys) or insert a placeholder batch upfront.
This sidesteps the rehash spike during a hot path, since Python won't resize until
the ⅔ threshold is hit again on the already-enlarged array.
Build a Hash Map from Scratch
We implement a minimal hash map using separate chaining.
Each bucket holds a list of (key, value) pairs.
Step through the animation to watch each line execute during put("name","Alice").
The list inside each bucket is the chain. When two keys hash to the same index, both (key, value) pairs coexist in that bucket's list. On lookup we scan only that short list — usually length 0 or 1 — which is effectively O(1) in practice.
Python's dict, defaultdict & Counter
# ── 1. Core dict operations ───────────────────────────────
d = {} # empty hash map — O(1)
d["name"] = "Alice" # insert / update — O(1) avg
d["age"] = 30
print(d.get("name", "N/A")) # safe lookup — no KeyError
print("age" in d) # membership — O(1)
del d["age"] # delete — O(1) avg
d.setdefault("score", 0) # insert only if missing
# ── 2. defaultdict — auto-creates missing keys ────────────
from collections import defaultdict
groups = defaultdict(list)
words = ["apple", "ant", "banana", "bear", "cat"]
for w in words:
groups[w[0]].append(w) # group by first letter — no KeyError
print(dict(groups))
# ── 3. Counter — frequency hash map ──────────────────────
from collections import Counter
text = "hello world"
freq = Counter(text) # O(n) build
print(freq.most_common(3)) # top 3 items — O(n log k)
print(freq["l"]) # O(1) count lookup
# ── Counter arithmetic ────────────────────────────────────
a = Counter(["a","b","a"])
b = Counter(["a","c"])
print(a + b) # combine counts
print(a - b) # subtract counts (drops zeros/negatives)
d[key] raises KeyError if missing.
d.get(key, default) returns the default instead —
zero overhead, zero crashes. In interview code and production code alike, .get()
is the professional choice.
Time & Space Complexity Reference
| Operation | Average Case | Worst Case | Notes |
|---|---|---|---|
d[key] = val — Insert / Update |
O(1) | O(n) | Worst case: all keys collide into one chain |
d[key] — Lookup |
O(1) | O(n) | Same collision scenario; extremely rare with good hash fn |
del d[key] — Delete |
O(1) | O(n) | Must find and remove entry from chain / slot |
key in d — Membership |
O(1) | O(n) | Same code path as lookup |
Iteration for k in d |
O(n) | O(n) | Must visit every populated bucket |
| Rehash (resize) | O(n) one-time | O(n) | Amortises to O(1) per insert across all ops |
| Space | O(n) | O(n) | Backing array is often 1.5–2× number of items |
Classic Hash Map Patterns
target − x is already in the map. Reduces classic O(n²) brute-force to O(n) in one pass.set. Membership checks become O(1) instead of O(n). Remove all duplicates in one pass.# ── Pattern 1: Two Sum — O(n) ─────────────────────────────
def two_sum(nums: list, target: int) -> tuple:
seen = {} # key=number, val=index
for i, x in enumerate(nums):
complement = target - x # what do we still need?
if complement in seen: # O(1) check
return (seen[complement], i) # found the pair!
seen[x] = i # store for future iterations
return (-1, -1)
# ── Pattern 2: Group Anagrams — O(n · k · log k) ─────────
from collections import defaultdict
def group_anagrams(words):
groups = defaultdict(list)
for w in words:
key = tuple(sorted(w)) # canonical anagram key
groups[key].append(w)
return list(groups.values())
# ── Pattern 3: Longest substring without repeating ────────
def length_of_longest_substring(s: str) -> int:
last_seen = {}
best = left = 0
for right, ch in enumerate(s):
if ch in last_seen and last_seen[ch] >= left:
left = last_seen[ch] + 1 # shrink window past duplicate
last_seen[ch] = right
best = max(best, right - left + 1)
return best
print(two_sum([2, 7, 11, 15], 9))
print(group_anagrams(["eat", "tea", "tan", "ate", "nat", "bat"]))
print(length_of_longest_substring("abcabcbb"))
Any time you find yourself writing a nested loop to check "does this value exist in the outer list?", stop — that's O(n²). Store the outer list in a hash map first, then the inner check becomes O(1). This single refactor pattern solves dozens of LeetCode problems.
Hash Map in Real Algorithms
# ── Subarray sum equals k — O(n) using prefix-sum map ────
def subarray_sum(nums, k):
count = 0
prefix = 0
seen = defaultdict(int)
seen[0] = 1 # empty prefix
for x in nums:
prefix += x
count += seen[prefix - k] # how many prior prefixes differ by k?
seen[prefix] += 1
return count
# ── LRU Cache — O(1) get & put — dict + doubly-linked list
from collections import OrderedDict
class LRUCache:
def __init__(self, capacity):
self.cap = capacity
self.cache = OrderedDict() # preserves insertion order
def get(self, key):
if key not in self.cache: return -1
self.cache.move_to_end(key) # mark as recently used
return self.cache[key]
def put(self, key, val):
if key in self.cache:
self.cache.move_to_end(key)
self.cache[key] = val
if len(self.cache) > self.cap:
self.cache.popitem(last=False) # evict LRU (front)
# ── isomorphic strings — O(n) with two maps ──────────────
def is_isomorphic(s, t):
s2t, t2s = {}, {}
for a, b in zip(s, t):
if s2t.get(a, b) != b or t2s.get(b, a) != a: return False
s2t[a] = b; t2s[b] = a
return True
print(subarray_sum([1,1,1], 2))
lru = LRUCache(2)
lru.put(1,1); lru.put(2,2); lru.put(3,3)
print(lru.get(1), lru.get(2))
print(is_isomorphic("egg", "add"), is_isomorphic("foo", "bar"))
The pattern seen[prefix − k] appears in subarray-sum, count of paths in trees, and range query problems. Every time you need to count "how many earlier positions satisfy a condition", prefix sums stored in a hash map are the tool.
Golden Rules
TypeError as keys.
Use tuple(lst) or frozenset(s) to create a hashable equivalent.
defaultdict or Counter before writing
any if key not in d guard. They eliminate an entire category
of boilerplate, run at the same speed, and make intent explicit.
Counter(iterable) builds a full frequency map in one line.
set or dict
first. The inner check drops from O(n) to O(1), making the whole algorithm O(n).
This refactor alone solves the majority of array-based interview problems.
PYTHONHASHSEED),
making deliberate collision attacks computationally infeasible.
dict a valid replacement for OrderedDict in most
new code. Use OrderedDict only when you need move_to_end()
or popitem(last=False) — as in LRU Cache implementations.