Data Structure 📂 Data Structures (Core) · 5 of 6 36 min read

Hash Tables & Hash Maps — Complete Python Tutorial

A comprehensive, animated tutorial covering hash tables from first principles (cinema seat analogy), collision handling via separate chaining, Python's dict/defaultdict/Counter, six canonical hash map patterns (Two-Sum, anagram grouping, memoisation, deduplication, sliding window), collections.deque vs plain list for O(1) double-ended ops, and heapq min-heap (push, pop, k-th largest, merge-k-sorted).

Section 01

The Story That Explains Hash Tables

Cinema Seat Numbers — The Ultimate O(1) Lookup
Imagine a cinema with 1,000 seats. A strict usher checks every seat one by one to find where "Alice" is sitting — that's O(n) linear search. A chaotic usher shouts "Alice!" and waits for a reply — also O(n) in the worst case.

A smart cinema pre-computes each person's seat from their ticket number. Your ticket number is 7421? Seat = 7421 % 100 = 21. The usher walks directly to row 21 — no scanning, no shouting.

That computation (ticket % capacity) is the hash function. The numbered rows are the buckets. The whole cinema is a Hash Table — built for instant, direct access.

A Hash Table (Hash Map) maps keys to values using a hash function to compute an array index directly. Average-case insert, delete, and lookup are all O(1) — constant time regardless of how many items are stored. Python's dict and set are hash tables under the hood.

💡
Why Hash Tables Rule Data Structures

Arrays give O(1) access by position. Hash tables give O(1) access by any key — a string, an integer, a tuple. They're the go-to structure any time you need to count, group, cache, or deduplicate data fast.


Section 02

How a Hash Table Works — The 4 Phases

01
Hash the Key
Feed the key into a hash function (Python's built-in hash()). This converts any key — string, int, tuple — into a large integer. Same key always produces the same hash. Different keys usually produce different hashes.
02
Compress to Index
Apply hash_value % capacity to get a valid array index (0 … capacity−1). This is the bucket number where the key-value pair will live.
03
Store / Retrieve
Write the (key, value) pair into that bucket. On lookup, hash the key again → same index → direct access. No scanning needed.
04
Handle Collisions
Two keys can land on the same bucket. The two standard strategies are Separate Chaining (a linked list per bucket) and Open Addressing (probe for the next free slot). Python uses open addressing with pseudo-random probing.
🎬 Interactive — Insert keys one by one (capacity = 8, separate chaining)
Click Insert Next to begin inserting keys.
⚠️
Collision — Two Keys, One Bucket

When two different keys produce the same index, that's a collision. The chaining strategy stores both as a list in that bucket and scans it on lookup. Python's dict avoids this entirely for most workloads using a high-quality hash function and rehashing at ~⅔ load factor.


Section 03

Collision Strategies Side-by-Side

Strategy A — Separate Chaining
Each bucketholds a linked list
Collisionappend to list — O(1)
Lookupscan list — O(L) where L = chain length
Memoryoverhead per node (pointer + data)
Used byJava HashMap, most textbook impls
Strategy B — Open Addressing
Each bucketholds at most one entry
Collisionprobe next slot — O(1) avg
Lookupprobe sequence — O(1) avg
Memorycompact flat array, no pointers
Used byPython dict, C++ unordered_map
🎬 Open Addressing — Linear Probe (capacity = 7)
Click Insert Next to probe for a free slot.

Section 04

Load Factor & Rehashing

Load factor α = n / capacity — the fraction of buckets in use. As α grows, collision probability rises and average lookup time degrades. Python's dict rehashes (doubles capacity and reinserts all entries) when α exceeds roughly ⅔ (0.67).

📐 What Happens During a Rehash
Trigger
n / capacity exceeds threshold (≈ 0.67 for Python dict)
Allocate
Create a new backing array of ~2× the old capacity
Reinsert
Recompute hash(key) % new_capacity for every existing entry — O(n) one-time cost
Amortise
Across all insertions the amortised cost is still O(1) per insert
🎬 Load Factor Meter — watch rehash trigger at ⅔
Buckets (capacity = 8)
Load factor: 0.00  |  Items: 0  /  Capacity: 8
Insert items to watch the load factor rise.
🔑
Pre-size Your Dict to Avoid Rehash Spikes

In latency-sensitive code, if you know you'll insert N items, pre-populate with d = dict.fromkeys(known_keys) or insert a placeholder batch upfront. This sidesteps the rehash spike during a hot path, since Python won't resize until the ⅔ threshold is hit again on the already-enlarged array.


Section 05

Build a Hash Map from Scratch

We implement a minimal hash map using separate chaining. Each bucket holds a list of (key, value) pairs. Step through the animation to watch each line execute during put("name","Alice").

🎬 Code Trace — HashMap.put("name", "Alice")
Click ▶ Step to begin.
OUTPUT — full test run
HashMap state after all puts: bucket[0]: [] bucket[1]: [('name', 'Alice')] bucket[2]: [] bucket[3]: [('city', 'Delhi')] bucket[4]: [('age', 25)] bucket[5]: [] bucket[6]: [('lang', 'Python')] bucket[7]: [] get('name') → 'Alice' get('city') → 'Delhi' delete('age') → True get('age') → None (KeyError-safe)
Why Each Bucket Holds a List

The list inside each bucket is the chain. When two keys hash to the same index, both (key, value) pairs coexist in that bucket's list. On lookup we scan only that short list — usually length 0 or 1 — which is effectively O(1) in practice.


Section 06

Python's dict, defaultdict & Counter

# ── 1. Core dict operations ───────────────────────────────
d = {}                                   # empty hash map — O(1)
d["name"] = "Alice"                     # insert / update — O(1) avg
d["age"]  = 30
print(d.get("name", "N/A"))             # safe lookup — no KeyError
print("age" in d)                       # membership — O(1)
del d["age"]                             # delete — O(1) avg
d.setdefault("score", 0)               # insert only if missing

# ── 2. defaultdict — auto-creates missing keys ────────────
from collections import defaultdict
groups = defaultdict(list)
words = ["apple", "ant", "banana", "bear", "cat"]
for w in words:
    groups[w[0]].append(w)              # group by first letter — no KeyError
print(dict(groups))

# ── 3. Counter — frequency hash map ──────────────────────
from collections import Counter
text = "hello world"
freq = Counter(text)                   # O(n) build
print(freq.most_common(3))             # top 3 items — O(n log k)
print(freq["l"])                        # O(1) count lookup

# ── Counter arithmetic ────────────────────────────────────
a = Counter(["a","b","a"])
b = Counter(["a","c"])
print(a + b)   # combine counts
print(a - b)   # subtract counts (drops zeros/negatives)
OUTPUT
Alice True {'a': ['apple', 'ant'], 'b': ['banana', 'bear'], 'c': ['cat']} [('l', 3), ('o', 2), ('h', 1)] 3 Counter({'a': 3, 'b': 1, 'c': 1}) Counter({'b': 1, 'a': 1})
📦
dict.get() vs dict[key] — Always Prefer .get() for Safe Access

d[key] raises KeyError if missing. d.get(key, default) returns the default instead — zero overhead, zero crashes. In interview code and production code alike, .get() is the professional choice.


Section 07

Time & Space Complexity Reference

Operation Average Case Worst Case Notes
d[key] = val — Insert / Update O(1) O(n) Worst case: all keys collide into one chain
d[key] — Lookup O(1) O(n) Same collision scenario; extremely rare with good hash fn
del d[key] — Delete O(1) O(n) Must find and remove entry from chain / slot
key in d — Membership O(1) O(n) Same code path as lookup
Iteration for k in d O(n) O(n) Must visit every populated bucket
Rehash (resize) O(n) one-time O(n) Amortises to O(1) per insert across all ops
Space O(n) O(n) Backing array is often 1.5–2× number of items

Section 08

Classic Hash Map Patterns

🔢
Frequency Counter
Counter / defaultdict(int)
Count occurrences of any item. Replaces O(n²) nested loops with a single O(n) pass. Foundation of most string/array problems.
🎯
Two-Sum / Complement
target − x lookup
For each x, check if target − x is already in the map. Reduces classic O(n²) brute-force to O(n) in one pass.
💾
Memoisation / Cache
functools.lru_cache
Store expensive results keyed by input. Turns exponential recursion into linear. The hash map is the entire engine of dynamic programming.
🗂️
Group By / Anagrams
sorted(word) as key
Group strings by a canonical key (e.g. sorted chars). All anagrams land in the same bucket. O(n·k·log k), no nested loops.
Deduplication
set() or dict keys
Convert a list to a set. Membership checks become O(1) instead of O(n). Remove all duplicates in one pass.
🌊
Sliding Window
char frequency maps
Maintain a frequency dict of the current window. Expand / contract in O(1) updates. Powers substring and subarray problems.
# ── Pattern 1: Two Sum — O(n) ─────────────────────────────
def two_sum(nums: list, target: int) -> tuple:
    seen = {}                            # key=number, val=index
    for i, x in enumerate(nums):
        complement = target - x         # what do we still need?
        if complement in seen:           # O(1) check
            return (seen[complement], i)  # found the pair!
        seen[x] = i                      # store for future iterations
    return (-1, -1)

# ── Pattern 2: Group Anagrams — O(n · k · log k) ─────────
from collections import defaultdict
def group_anagrams(words):
    groups = defaultdict(list)
    for w in words:
        key = tuple(sorted(w))            # canonical anagram key
        groups[key].append(w)
    return list(groups.values())

# ── Pattern 3: Longest substring without repeating ────────
def length_of_longest_substring(s: str) -> int:
    last_seen = {}
    best = left = 0
    for right, ch in enumerate(s):
        if ch in last_seen and last_seen[ch] >= left:
            left = last_seen[ch] + 1     # shrink window past duplicate
        last_seen[ch] = right
        best = max(best, right - left + 1)
    return best

print(two_sum([2, 7, 11, 15], 9))
print(group_anagrams(["eat", "tea", "tan", "ate", "nat", "bat"]))
print(length_of_longest_substring("abcabcbb"))
OUTPUT
(0, 1) ← 2 + 7 = 9 [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']] 3 ← "abc"
🎬 Two-Sum Trace — nums=[2,7,11,15], target=9
seen { }
{}
current step
Click ▶ Step to begin.
🚀
The Hash Map Refactor — From O(n²) to O(n)

Any time you find yourself writing a nested loop to check "does this value exist in the outer list?", stop — that's O(n²). Store the outer list in a hash map first, then the inner check becomes O(1). This single refactor pattern solves dozens of LeetCode problems.


Section 09

Hash Map in Real Algorithms

# ── Subarray sum equals k — O(n) using prefix-sum map ────
def subarray_sum(nums, k):
    count     = 0
    prefix    = 0
    seen      = defaultdict(int)
    seen[0]   = 1                       # empty prefix
    for x in nums:
        prefix += x
        count  += seen[prefix - k]     # how many prior prefixes differ by k?
        seen[prefix] += 1
    return count

# ── LRU Cache — O(1) get & put — dict + doubly-linked list
from collections import OrderedDict
class LRUCache:
    def __init__(self, capacity):
        self.cap   = capacity
        self.cache = OrderedDict()       # preserves insertion order

    def get(self, key):
        if key not in self.cache: return -1
        self.cache.move_to_end(key)     # mark as recently used
        return self.cache[key]

    def put(self, key, val):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = val
        if len(self.cache) > self.cap:
            self.cache.popitem(last=False) # evict LRU (front)

# ── isomorphic strings — O(n) with two maps ──────────────
def is_isomorphic(s, t):
    s2t, t2s = {}, {}
    for a, b in zip(s, t):
        if s2t.get(a, b) != b or t2s.get(b, a) != a: return False
        s2t[a] = b; t2s[b] = a
    return True

print(subarray_sum([1,1,1], 2))
lru = LRUCache(2)
lru.put(1,1); lru.put(2,2); lru.put(3,3)
print(lru.get(1), lru.get(2))
print(is_isomorphic("egg", "add"), is_isomorphic("foo", "bar"))
OUTPUT
2 ← 2 subarrays sum to k=2: [1,1] at 0..1 and 1..2 -1 2 ← key 1 was evicted (LRU), key 2 still alive True False
🧠
Prefix Sum + Hash Map — A Lethal Combination

The pattern seen[prefix − k] appears in subarray-sum, count of paths in trees, and range query problems. Every time you need to count "how many earlier positions satisfy a condition", prefix sums stored in a hash map are the tool.


Section 10

Golden Rules

🗝️ Hash Tables / Hash Maps — Non-Negotiable Rules
1
Only hashable types can be dict keys or set elements. Integers, strings, and tuples of hashables are fine. Lists, dicts, and sets are mutable and not hashable — they raise TypeError as keys. Use tuple(lst) or frozenset(s) to create a hashable equivalent.
2
Reach for defaultdict or Counter before writing any if key not in d guard. They eliminate an entire category of boilerplate, run at the same speed, and make intent explicit. Counter(iterable) builds a full frequency map in one line.
3
Any nested-loop pattern that asks "does X exist in this list?" is an O(n²) smell. Convert the outer list to a set or dict first. The inner check drops from O(n) to O(1), making the whole algorithm O(n). This refactor alone solves the majority of array-based interview problems.
4
The worst case O(n) exists but is irrelevant in practice. It requires an adversary crafting keys that all hash to the same bucket. Python randomises hash seeds at interpreter start (PYTHONHASHSEED), making deliberate collision attacks computationally infeasible.
5
Load factor is the silent performance killer. Python resizes at ⅔ capacity, keeping average chain length near 1. In your own hash map implementations, always resize before exceeding 0.7 load factor — beyond that, collision rates spike non-linearly.
6
Python dicts preserve insertion order since Python 3.7 — rely on it. This makes dict a valid replacement for OrderedDict in most new code. Use OrderedDict only when you need move_to_end() or popitem(last=False) — as in LRU Cache implementations.