The Story That Explains Data Structures
Now imagine a second library: books are organised by genre, then author, then title. There is a card index at the front. A map on the wall. A numbering system on every shelf. You find any book in under a minute. This is what data structures do for a computer — they give raw data a shape, an address, and a purpose.
Choosing the wrong data structure is like choosing the wrong library system. A computer scientists' most important skill is knowing which structure to pick — and why.
A data structure is a way of organising, storing, and managing data in a computer so that it can be accessed and modified efficiently. Every program you write uses data structures — whether you know it or not. Lists, dictionaries, queues, trees — these are all data structures, each with its own strengths, weaknesses, and ideal use cases.
The difference between a program that runs in 0.001 seconds and one that takes 10 hours on the same machine is almost always the data structure chosen. Algorithms and data structures are two sides of the same coin — a great algorithm with a bad data structure is still slow. Knowing both is what separates junior developers from senior engineers.
The Big Picture — Types of Data Structures
Data structures are broadly split into two families: Linear (elements arranged in a sequence) and Non-Linear (elements connected in hierarchies or networks).
Each family trades memory for speed in different ways. The right choice depends entirely on your most frequent operation.
Arrays — The Foundation of Everything
An array is a contiguous block of memory holding elements of the same type.
Every element occupies the same number of bytes. Given a base address and an index,
the CPU can compute any element's exact location in one arithmetic operation:
address = base + index × element_size. This is why array access is always O(1).
Highlighted cell [3] is accessed instantly — the CPU computes its exact memory address without scanning.
Array Complexity Summary
| Operation | Best | Average | Worst | Notes |
|---|---|---|---|---|
| Access by index | O(1) | O(1) | O(1) | Direct address computation |
| Search (unsorted) | O(1) | O(n) | O(n) | Linear scan needed |
| Search (sorted) | O(1) | O(log n) | O(log n) | Binary search applies |
| Insert at end | O(1) | O(1)* | O(n) | *Amortised (dynamic arrays) |
| Insert at middle | O(n) | O(n) | O(n) | Must shift elements right |
| Delete from end | O(1) | O(1) | O(1) | Just decrease size |
| Delete from middle | O(n) | O(n) | O(n) | Must shift elements left |
Python: Arrays in Action
# Python lists are dynamic arrays under the hood
arr = [42, 17, 88, 56, 23, 99, 11]
# O(1) — direct index access
print(arr[3]) # → 56
print(arr[-1]) # → 11 (last element)
# O(1) amortised — append to end
arr.append(77)
# O(n) — insert in the middle (shifts elements)
arr.insert(2, 100) # insert 100 at index 2
# O(n) — delete from middle
arr.pop(2) # remove element at index 2
# O(log n) — binary search on sorted array
import bisect
sorted_arr = sorted(arr)
idx = bisect.bisect_left(sorted_arr, 56)
print(f"56 found at index {idx}")
# Memory-efficient numeric arrays (numpy)
import numpy as np
np_arr = np.array([1, 2, 3, 4, 5], dtype=np.int32)
print(np_arr * 2) # → [2, 4, 6, 8, 10] — vectorised O(n)
Linked Lists — Flexible but Scattered
That is a linked list. Each element (a node) holds two things: the actual data, and a pointer to the next node. You can insert a new clue anywhere in the hunt without rewriting all the others — just change two pointers. But to find clue 5, you always start at clue 1 and count forward.
Insertion after a known node requires only two pointer changes — O(1). But finding that node first requires O(n) traversal.
Types of Linked Lists
deque is built on this.# Building a Singly Linked List from scratch in Python
class Node:
def __init__(self, data):
self.data = data
self.next = None
class LinkedList:
def __init__(self):
self.head = None
def append(self, data): # O(n) — must walk to tail
node = Node(data)
if not self.head:
self.head = node; return
cur = self.head
while cur.next:
cur = cur.next
cur.next = node
def prepend(self, data): # O(1) — insert at head
node = Node(data)
node.next = self.head
self.head = node
def delete(self, data): # O(n) — search then remove
cur = self.head
if cur and cur.data == data:
self.head = cur.next; return
while cur.next:
if cur.next.data == data:
cur.next = cur.next.next; return
cur = cur.next
def __repr__(self):
vals, cur = [], self.head
while cur:
vals.append(str(cur.data))
cur = cur.next
return ' → '.join(vals) + ' → NULL'
ll = LinkedList()
for v in [42, 17, 88, 56]:
ll.append(v)
ll.prepend(99)
ll.delete(17)
print(ll)
Stacks — Last In, First Out
Your computer uses a stack constantly: every time you call a function, a "stack frame" (containing local variables and return address) is pushed onto the call stack. When the function returns, its frame is popped. When you get a "stack overflow" error, you have pushed too many frames — infinite recursion is the usual culprit.
# Stack using Python list (list.append = push, list.pop = pop)
stack = []
stack.append('A') # push A
stack.append('B') # push B
stack.append('C') # push C
print("Top:", stack[-1]) # peek → C
print("Pop:", stack.pop()) # pop → C
print("Stack:", stack) # ['A', 'B']
# Classic interview problem: balanced parentheses
def is_balanced(s):
stack, pairs = [], {')':'(', ']':'[', '}':'{'}
for ch in s:
if ch in '([{':
stack.append(ch)
elif ch in pairs:
if not stack or stack[-1] != pairs[ch]:
return False
stack.pop()
return not stack
print(is_balanced("({[]})")) # True
print(is_balanced("([)]")) # False
Queues — First In, First Out
Operating systems use queues everywhere: print jobs, CPU task scheduling, network packet buffers, message brokers (Kafka, RabbitMQ), and BFS graph traversal all rely on queue logic. Fairness is built-in — no element jumps the line.
collections.deque gives O(1) at both ends.collections.deque implements this using a doubly linked list.from collections import deque
import heapq
# ── Simple Queue ──────────────────────────────────────
q = deque()
q.append('Task A') # enqueue → O(1)
q.append('Task B')
q.append('Task C')
print(q.popleft()) # dequeue → 'Task A' O(1)
# ── Priority Queue (Min-Heap) ─────────────────────────
pq = []
heapq.heappush(pq, (3, 'Low priority'))
heapq.heappush(pq, (1, 'Critical task'))
heapq.heappush(pq, (2, 'Normal task'))
while pq:
priority, task = heapq.heappop(pq) # always pops minimum
print(f" Priority {priority}: {task}")
# ── Deque (sliding window max) ─────────────────────────
def sliding_max(nums, k):
dq, result = deque(), []
for i, n in enumerate(nums):
while dq and nums[dq[-1]] <= n:
dq.pop()
dq.append(i)
if dq[0] == i - k:
dq.popleft()
if i >= k - 1:
result.append(nums[dq[0]])
return result
print(sliding_max([1,3,-1,-3,5,3,6,7], k=3))
Hash Tables — The O(1) Lookup Machine
Two guests could theoretically be assigned the same room — this is called a collision. The hotel handles it by chaining a second key inside the same room (separate chaining) or finding the next empty room (open addressing). A good hash function makes collisions rare. This is exactly how Python's
dict works.
"date" (highlighted) maps to bucket [4]. Load factor (items÷buckets) should stay below 0.7 for good performance.
# Python dict IS a hash table — O(1) average for all operations
prices = {
'apple': 1.20,
'banana': 0.50,
'cherry': 3.50,
'date': 0.80
}
# O(1) — lookup, insert, delete
print(prices['cherry']) # 3.5
prices['elderberry'] = 5.00 # insert
del prices['date'] # delete
print('apple' in prices) # True → O(1)
# Frequency counter — classic hash table pattern
words = ['cat', 'dog', 'cat', 'bird', 'dog', 'cat']
freq = {}
for w in words:
freq[w] = freq.get(w, 0) + 1
print(freq) # {'cat': 3, 'dog': 2, 'bird': 1}
# Two-sum problem — O(n) using hash table
def two_sum(nums, target):
seen = {}
for i, n in enumerate(nums):
complement = target - n
if complement in seen:
return [seen[complement], i]
seen[n] = i
return []
print(two_sum([2, 7, 11, 15], 9)) # [0, 1]
Trees — Hierarchical Power
A tree is a hierarchical data structure with a root node and subtrees of children. Every node except the root has exactly one parent. Trees without cycles are everywhere: filesystems, HTML DOM, JSON, compiler ASTs, and databases all use trees internally.
BST search: at each node, go left if target < node, right if target > node. Finds 60 in 3 comparisons instead of 7.
Tree Traversal Methods
class BST:
class Node:
def __init__(self, val):
self.val, self.left, self.right = val, None, None
def __init__(self):
self.root = None
def insert(self, val): # O(log n) avg, O(n) worst (skewed)
def _ins(node, v):
if not node: return BST.Node(v)
if v < node.val: node.left = _ins(node.left, v)
else: node.right = _ins(node.right, v)
return node
self.root = _ins(self.root, val)
def inorder(self): # O(n) — sorted output
result = []
def _walk(node):
if node:
_walk(node.left)
result.append(node.val)
_walk(node.right)
_walk(self.root); return result
def search(self, val): # O(log n) average
node = self.root
while node:
if val == node.val: return True
elif val < node.val: node = node.left
else: node = node.right
return False
bst = BST()
for v in [50, 30, 70, 20, 40, 60, 80]:
bst.insert(v)
print("In-order:", bst.inorder())
print("Search 60:", bst.search(60))
print("Search 55:", bst.search(55))
Heaps — The Priority Expert
A heap is a complete binary tree that satisfies the heap property: in a min-heap, every parent is smaller than its children. In a max-heap, every parent is larger. The root is always the smallest (or largest) element, making it perfect for priority queues and sorting.
A heap is a complete binary tree stored as a flat array. No pointers needed — parent/child relationships are computed by index arithmetic.
import heapq
# Python heapq is a MIN-heap by default
nums = [7, 3, 1, 5, 4, 2, 6]
heapq.heapify(nums) # O(n) — convert list to heap in-place
print("Heap:", nums)
# Always pops the minimum
print("Pop min:", heapq.heappop(nums)) # 1 → O(log n)
print("Pop min:", heapq.heappop(nums)) # 2 → O(log n)
# Find k largest elements — classic heap use
data = [10, 4, 20, 1, 7, 15, 3]
print("3 largest:", heapq.nlargest(3, data)) # [20, 15, 10]
print("3 smallest:", heapq.nsmallest(3, data)) # [1, 3, 4]
# Max-heap trick: negate the values
max_heap = []
for n in data:
heapq.heappush(max_heap, -n)
print("Max:", -heapq.heappop(max_heap)) # 20
Graphs — The Most Powerful Structure
BFS (Breadth-First Search) finds the shortest path in an unweighted graph — like finding the fewest hops between two Facebook friends. Dijkstra's algorithm finds the shortest path in a weighted graph — like the fastest route on Google Maps.
from collections import deque
# Adjacency list representation (most common)
graph = {
'A': ['B', 'C'],
'B': ['D'],
'C': ['D'],
'D': ['E', 'F'],
'E': ['F'],
'F': []
}
# BFS — shortest path (unweighted)
def bfs(graph, start, goal):
queue = deque([[start]])
visited = {start}
while queue:
path = queue.popleft()
node = path[-1]
if node == goal: return path
for neighbour in graph[node]:
if neighbour not in visited:
visited.add(neighbour)
queue.append(path + [neighbour])
return None
# DFS — explores depth first
def dfs(graph, node, visited=None):
if visited is None: visited = set()
visited.add(node)
result = [node]
for n in graph[node]:
if n not in visited:
result += dfs(graph, n, visited)
return result
print("BFS A→F:", bfs(graph, 'A', 'F'))
print("DFS from A:", dfs(graph, 'A'))
Big-O Complexity — Choosing Wisely
Every data structure has a complexity profile. This is the single most important table in this tutorial — memorise it, and you will make the right architecture decisions in every interview and production system.
| Data Structure | Access | Search | Insert | Delete | Space | Best Use Case |
|---|---|---|---|---|---|---|
| Array | O(1) | O(n) | O(n) | O(n) | O(n) | Index-heavy access, numeric computation |
| Linked List | O(n) | O(n) | O(1)* | O(1)* | O(n) | Frequent insertion/deletion, queues, undo stacks |
| Stack | O(n) | O(n) | O(1) | O(1) | O(n) | LIFO ordering, call stacks, expression parsing |
| Queue | O(n) | O(n) | O(1) | O(1) | O(n) | FIFO scheduling, BFS, message buffers |
| Hash Table | N/A | O(1)** | O(1)** | O(1)** | O(n) | Fast lookup by key, caching, deduplication |
| BST (balanced) | O(n) | O(log n) | O(log n) | O(log n) | O(n) | Sorted data, range queries, databases |
| Heap | O(1)† | O(n) | O(log n) | O(log n) | O(n) | Priority queues, top-k, Dijkstra's |
| Graph (adj list) | N/A | O(V+E) | O(1) | O(E) | O(V+E) | Relationships, networks, routing, social graphs |
* with known node reference · ** average case · † min or max only
Ask three questions: (1) What operation is most frequent? — if it's lookup by key, use a hash table. If it's finding minimum/maximum repeatedly, use a heap. (2) Is the data ordered? — if yes and you need range queries, use a BST. (3) Is memory tight? — arrays are the most memory-efficient per element; linked structures add pointer overhead.
Structure Comparison — When to Use What
| Scenario | Wrong Choice | Right Choice | Why |
|---|---|---|---|
| Check if username exists in a database of 10M users | Array / List | Hash Set | O(1) lookup vs O(n) scan |
| Undo the last 20 actions in a text editor | Queue | Stack | LIFO — last action undone first |
| Find the shortest path between cities | Array | Graph + BFS/Dijkstra | Relationships naturally form a graph |
| Process customer support tickets in arrival order | Stack | Queue | FIFO — first arrived, first served |
| Always serve the highest-urgency hospital patient next | Queue | Priority Queue (Heap) | Need "most urgent first", not "oldest first" |
| Find all employees earning between £40k–£60k | Hash Table | BST / Sorted Array | Range queries need ordered structure |
| Frequent insertion/removal in the middle of a list | Array | Doubly Linked List | Array shifts are O(n); linked insert is O(1) |
| Store 1 million integers and access by index constantly | Linked List | Array (numpy) | Cache-friendly contiguous memory, O(1) access |
Advanced: Tries — The String Search Expert
A Trie (prefix tree) is a tree specialised for strings. Each node represents a single character. Words are paths from root to a marked node. If 1,000 words share the prefix "pre", that prefix is stored only once — not 1,000 times.
Prefix "ca" is stored once and shared by "car", "card", "care", "cat". Trie lookup is O(m) where m = length of the word — independent of how many words are stored.
Autocomplete in search engines and IDEs, spell checkers, IP routing tables (CIDR prefix matching), phone contact search, and word games like Scrabble all use Trie structures. The reason: no other structure gives O(m) prefix search with shared storage as elegantly.
# Trie implementation in Python
class TrieNode:
def __init__(self):
self.children = {} # char → TrieNode
self.is_end = False # marks complete word
class Trie:
def __init__(self):
self.root = TrieNode()
def insert(self, word): # O(m) where m = len(word)
node = self.root
for ch in word:
node = node.children.setdefault(ch, TrieNode())
node.is_end = True
def search(self, word): # O(m) — exact match
node = self.root
for ch in word:
if ch not in node.children: return False
node = node.children[ch]
return node.is_end
def starts_with(self, prefix): # O(p) — autocomplete check
node = self.root
for ch in prefix:
if ch not in node.children: return False
node = node.children[ch]
return True
trie = Trie()
for w in ['car', 'card', 'care', 'cat', 'cut']:
trie.insert(w)
print(trie.search('card')) # True
print(trie.search('car')) # True
print(trie.search('cars')) # False (not inserted)
print(trie.starts_with('ca')) # True (prefix check)
print(trie.starts_with('cu')) # True
print(trie.starts_with('cv')) # False
Golden Rules
dict and set are hash tables.
Use them relentlessly for membership testing (x in s), deduplication,
and frequency counting. Converting a list to a set before searching is one of the most
common O(n) → O(1) optimisations in Python.
collections.deque instead of list as a queue.
list.pop(0) is O(n) because it shifts all remaining elements.
deque.popleft() is O(1). This single swap can reduce BFS from O(n²) to O(n).
sortedcontainers.SortedList) for production work
that requires ordered access.