Software Architectures 📂 Software Design in the Real World · 2 of 5 56 min read

Amazon Architecture Explained: AWS, Microservices, DynamoDB & Event-Driven Design

A deep-dive into how Amazon evolved from a single-server bookshop monolith into the world's most sophisticated distributed platform — covering its service-oriented architecture, AWS core services (EC2, S3, Lambda, DynamoDB, SQS/SNS), event-driven design, caching strategy, auto scaling, observability, the fulfilment network, and the architectural patterns (CQRS, Saga, Circuit Breaker, Strangler Fig) that power one of the most reliable and scalable systems ever built.

Section 01

The Story That Explains Amazon's Architecture

From a Single Garage to a City of a Million Warehouses
Imagine you open a bookshop in your garage. You do everything yourself — you take orders, pack boxes, handle returns, manage the accounts, and deliver on weekends. Business is good, so you hire a few people. Then a few more. One day you wake up and you have 10,000 employees, warehouses on five continents, and tens of millions of customers browsing simultaneously at 3 a.m.

The systems that worked in your garage — one spreadsheet, one phone, one shelf — collapse instantly at this scale. You need a completely different way of thinking about infrastructure, teams, and services.

That is exactly the journey Amazon made from 1994 to today. And the architectural decisions they were forced to make along the way became the blueprint that every large tech company now copies.

Amazon did not start with a clean-room architectural vision. They started with a monolith — one giant application where everything was tangled together. Every feature, every team, and every database was woven into a single codebase. Deploying a change to the homepage required the entire company to hold its breath.

⚠️
The Monolith Problem at Scale

By the early 2000s, Amazon's monolith had grown so large that a bug in the Wish List feature could take down the Checkout service. A slow database query in Reviews could delay order fulfilment. Every team was blocked by every other team. The system was a ticking time bomb — and Black Friday was approaching.


Section 02

Amazon's Architecture — The 30,000-Foot View

Amazon's architecture is best understood as three concentric rings, each solving a different problem at a different layer.

🏗️ Amazon Three-Ring Architecture — Live Diagram
INFRASTRUCTURE LAYER (AWS)
PLATFORM LAYER
CORE
SERVICES
🌍
CloudFront CDN
🗄️
S3 Storage
Lambda
🔁
SQS / SNS
🛡️
IAM / Shield
💾
RDS / DynamoDB
📦
Order Svc
🔍
Search Svc
🛒
Cart Svc
💳
Payment Svc
👤
User Svc
Core Domain Services
Platform Microservices
AWS Infrastructure

Each ring is independently scalable. A spike in Search traffic does not affect Payments. A DynamoDB update does not require redeploying the Cart service.


Section 03

The Two-Pizza Team Rule & Service-Oriented Architecture

"All teams will henceforth expose their data and functionality through service interfaces."
Around 2002, Jeff Bezos sent an internal memo that became legendary in software engineering circles. The memo had six points — the last of which simply said: "Anyone who doesn't do this will be fired."

The mandate was brutally simple: every team must expose its data through a service interface. No direct database calls from other teams. No shared memory. If you want another team's data, you call their API. Period.

This single memo restructured how Amazon built software — and accidentally invented the approach that would later power AWS, the most profitable cloud business in history.

Bezos also coined the Two-Pizza Rule: if a team cannot be fed with two pizzas, it is too large. Small teams own small services. Small services can be deployed, scaled, and killed independently.

🍕
Two-Pizza Teams
5–8 Engineers Max
Each team owns exactly one service end-to-end: design, build, deploy, monitor, and on-call. No committee approvals. No cross-team deployment gates. You build it, you run it.
🔌
API Contracts
The Only Interface
Teams communicate exclusively through well-defined API contracts. The implementation behind the API can change at any time, as long as the contract holds. Complete internal freedom, enforced external stability.
🧱
Service Isolation
Blast Radius Containment
When a service fails, only that service fails. The surrounding services degrade gracefully or fall back. The failure does not cascade. This is the entire point of the architecture — resilience through isolation.
⚡ Monolith vs Service-Oriented Architecture — Animated
❌ Monolith (2001)
🛒 Cart + Checkout
🔍 Search + Catalog
👤 User + Auth
💳 Payments + Billing
📦 Inventory + Shipping
⭐ Reviews + Recs
🗄️ ONE Shared Database
⚡ One deploy = entire site risk
✅ Service-Oriented (2004+)
🛒 Cart
Service
🔍 Search
Service
👤 User
Service
💳 Pay
Service
📦 Inventory
Service
⭐ Recs
Service
🔁 Event Bus (SQS / SNS / EventBridge)
✅ Each service deploys independently

The monolith (left) shakes as a whole when any single module is stressed. Each SOA node (right) floats independently — stress on Cart does not ripple to Payments.


Section 04

AWS — The Product That Came From Internal Pain

Amazon Accidentally Built the World's Biggest Cloud
Between 2000 and 2003, Amazon's engineering teams were drowning in undifferentiated heavy lifting. Every new service had to provision servers, configure networking, manage storage, and handle security from scratch. It was the same painful work every time.

They built internal tools to solve this. Then they standardised those tools into platforms. Then someone had an idea: what if we sell access to these platforms?

In 2006, Amazon launched S3 and EC2 — and AWS was born. Today AWS accounts for roughly $100 billion in annual revenue and more than two-thirds of Amazon's total operating profit. The "real" Amazon business — retail — is actually subsidised by the cloud business it built to survive its own growing pains.
📅
2006 — S3 & EC2
Cloud Storage + Compute
Amazon launches Simple Storage Service (S3) and Elastic Compute Cloud (EC2). The concept of renting elastic compute by the hour redefines the entire industry.
📅
2009 — RDS & VPC
Managed Databases + Networking
Relational Database Service removes the operational burden of running databases. Virtual Private Cloud lets enterprises isolate their AWS workloads like a private data centre.
📅
2014 — Lambda
Serverless Compute
AWS Lambda introduces the concept of functions-as-a-service. You write code. AWS handles every other aspect of its execution — servers, scaling, patching, billing per millisecond.

Section 05

Core AWS Services — What They Do & Why They Exist

🌐 A Request's Journey Through Amazon's Infrastructure
🌍
Step 1 — Route 53 (DNS)
A customer types amazon.com. Route 53 receives this DNS query and resolves it to the nearest healthy endpoint using latency-based routing. If an entire AWS region goes down, Route 53 automatically fails over to the next healthy region within 60 seconds.
🛡️
Step 2 — CloudFront + WAF (Edge)
The request hits Amazon's global edge network — 600+ Points of Presence in 90 countries. CloudFront serves cached static assets (images, JS, CSS) from the edge node nearest to the user. AWS WAF inspects every request for SQL injection, XSS, and known attack patterns before anything reaches the origin.
⚖️
Step 3 — Application Load Balancer
Dynamic requests pass through an Application Load Balancer (ALB). The ALB routes by URL path (/search/* → Search service, /cart/* → Cart service, /checkout/* → Checkout service). It health-checks every backend instance every 30 seconds and removes unhealthy ones instantly.
📦
Step 4 — ECS / EKS (Containers)
The request lands on a containerised microservice running in either Elastic Container Service (ECS) or Elastic Kubernetes Service (EKS). Each microservice is its own Docker container, with its own memory, CPU, and deployment lifecycle. Auto Scaling Groups automatically spawn more containers when CPU > 70% or queue depth rises.
💾
Step 5 — ElastiCache + DynamoDB / RDS
The service first checks ElastiCache (Redis) for a cached result. Cache hit → return in <1 ms. Cache miss → query DynamoDB (NoSQL, millisecond latency at any scale) or Aurora RDS (relational, for transactional data). DynamoDB alone handles tens of millions of requests per second during Prime Day.
🔁
Step 6 — SQS / SNS / EventBridge (Async)
After the synchronous response is returned to the user, asynchronous side-effects are published as events: "order placed" → SQS queue consumed by Inventory, Shipping, Notifications, Analytics services. No tight coupling. Each downstream service processes at its own pace. If one is slow, the queue buffers without losing a single event.

Each step is handled by a separate, independently scalable AWS service. A bottleneck at Step 5 never affects Step 2.


Section 06

DynamoDB — Amazon's Most Important Database Innovation

Amazon needed a database that could handle tens of millions of peak requests per second with single-digit millisecond latency — globally, without downtime. No existing database came close. So they built one.

Dynamo: Amazon's Highly Available Key-Value Store
Amazon's internal infrastructure team published a paper in 2007 describing the design of Dynamo, their internal key-value store. The paper introduced concepts — consistent hashing, vector clocks, sloppy quorums — that became the DNA of an entire generation of NoSQL databases including Cassandra, Riak, and eventually the public DynamoDB released in 2012.

The core insight: for shopping cart and session data, it is acceptable for different servers to have slightly different views of the data for a few milliseconds — as long as the system is always available and never blocks. Availability trumps consistency for most e-commerce workloads. This trade-off (known as the CAP theorem) was revolutionary.
🏗️ How DynamoDB Handles a Write at Planet Scale
Step 1
Partition Key Hashing: The partition key (e.g., userId#42) is hashed via a consistent hash function to determine which storage node owns this data.
Step 2
Leader Write: The write goes to the designated leader replica for that partition. The leader writes to its local storage and simultaneously replicates to two follower replicas across different AZs.
Step 3
Quorum ACK: Once 2 of 3 replicas confirm the write, success is returned to the client. This is quorum — tolerating one AZ failure without data loss or blocking.
Step 4
Global Tables: If Global Tables is enabled, the write is asynchronously replicated to all configured regions (e.g., us-east-1 → eu-west-1 → ap-southeast-1) within seconds. Active-active multi-region with last-writer-wins conflict resolution.
Result
End-to-end write latency: < 5 ms at the 99th percentile. Amazon's Prime Day: 105 trillion DynamoDB API calls in a single event.
Feature DynamoDB MySQL / PostgreSQL MongoDB
Scaling Model Horizontal (auto-sharding) Vertical (read replicas) Manual sharding
Peak RPS Unlimited (auto) Thousands (tuned) Hundreds of thousands
Latency (p99) < 5 ms 10–100 ms 5–50 ms
Operational Overhead Zero (fully managed) High (patches, vacuums, failover) Medium
ACID Transactions Yes (within 25 items) Full ACID Single-document ACID
Best For Session, cart, gaming, IoT Financial records, reporting Semi-structured documents

Section 07

Event-Driven Architecture — How Amazon's Services Talk

When a customer places an order on Amazon, a cascade of things must happen: inventory must be decremented, a shipping label must be generated, a confirmation email must be sent, the analytics pipeline must record the sale, the recommendation engine must update the model. If all of this happened synchronously — waiting for each step in sequence — checkout would take 30 seconds.

💡
The Event-Driven Principle

Instead of Service A calling Service B calling Service C in a chain, Amazon's architecture uses events. Service A emits an event — "OrderPlaced" — and immediately returns to the user. Every downstream service that cares about orders subscribes to that event and processes it independently, asynchronously, at its own pace. No one waits for anyone.

📡 Order Placed — Event Fan-Out Animation
🛒 OrderPlaced
orderId: #A18-293847
SNS Topic → SQS Fan-Out
📦
Inventory Service
Decrement stock
SLA: < 500 ms
🚚
Shipping Service
Generate label
SLA: < 2 s
📧
Notification Svc
Send email/SMS
SLA: < 5 s
📊
Analytics Service
Record in warehouse
SLA: < 30 s
🎯
Recommendations
Update model
SLA: < 60 s
💳
Fraud Detection
Flag if anomaly
SLA: < 200 ms

All six services receive the same event concurrently. The user got their "Order confirmed" response before any of these downstream processes even started.


Section 08

AWS Lambda & Serverless — The Infrastructure Vanishes

Lambda is the most radical idea in Amazon's architecture: you stop thinking about servers entirely. You write a function. AWS handles everything else — the server, the operating system, the runtime, the network, the scaling, the health checks, the patching.

Cold Start
First Invocation Latency
When a Lambda function is invoked for the first time (or after a period of inactivity), AWS must provision a compute environment, download your code, and initialise the runtime. This "cold start" adds 100–500 ms. Provisioned Concurrency eliminates it by keeping containers warm.
📈
Scaling
Infinite Concurrent Execution
Lambda scales to 1,000 concurrent executions per region by default (soft limit; can be raised to millions). Each request gets its own isolated execution environment. 1 request and 1 million requests have identical per-request performance characteristics.
💰
Pricing
Pay Per 1 ms of Execution
You pay only for the compute time you consume — rounded to the nearest millisecond. A function that runs for 200 ms costs exactly 200 ms of compute. When no functions are running, you pay nothing. A traditional EC2 instance costs money even when idle at 3 a.m.
⚡ Lambda Execution Lifecycle — Animated Step by Step
INIT
Cold Start — Environment Provisioning (Cold path only)
AWS spins up a new micro-VM (Firecracker), downloads your deployment package from S3, extracts it, and initialises the runtime (Node.js, Python, Java, etc). This happens once per container instance. Duration: 100–800 ms depending on runtime and package size.
CODE
Module Initialisation (Cold path only)
Your function's module-level code runs — database connection pools are created, configuration is loaded, heavy dependencies are required. This is the code outside your handler function. With Java or large Python packages, this can add 1–3 seconds.
RUN
Handler Execution (Every invocation)
Your actual handler function runs. This is the only part that executes on every invocation. This is what you are billed for — measured to the nearest 1 ms. Billing begins when your handler starts and ends when it returns (or times out).
WARM
Container Kept Warm (Subsequent invocations)
After execution, AWS keeps the container alive for 5–15 minutes. The next invocation reuses the same container, skipping the cold start entirely. Database connections established in the init phase are reused. This is the "warm start" path — adds zero overhead.
IDLE
Container Frozen & Recycled
After inactivity, AWS freezes and eventually recycles the container. No cost is incurred during this idle period. The next invocation will experience a cold start again. Solution: Provisioned Concurrency pre-initialises N containers permanently — eliminating cold starts at a fixed cost.

The cold start path (blue) only happens once per container. All subsequent invocations (amber → green) reuse the warm container and only execute your handler.


Section 09

Regions, Availability Zones & the Geography of Resilience

Why One Data Centre Is Never Enough
On 14 August 2003, a software bug in an energy management system triggered a cascade failure that blacked out the entire northeast United States and parts of Canada — 55 million people, 256 power plants, including data centres that had no idea the grid was about to disappear.

Amazon had experienced their own infrastructure catastrophes. They knew that any single physical location could fail for reasons completely outside their control: flooding, fires, power cuts, earthquakes, fibre cuts, hardware failures, or simply someone tripping over a cable.

Their answer: design for failure. Never deploy anything to a single physical location. Assume every component will fail, and architect accordingly.
🌏 AWS Region → Availability Zone → Data Centre Hierarchy
🌍 Region: us-east-1 (N. Virginia) — one of 33 AWS Regions globally
AZ-1a 📍
Data Centre A
Data Centre B
AZ-1b 📍
Data Centre C
Data Centre D
AZ-1c 📍
Data Centre E
Data Centre F
33
AWS Regions worldwide
105+
Availability Zones
600+
CloudFront Edge PoPs

AZs within a region are physically separated by ≥ 100 km with independent power, cooling, and networking — but connected by Amazon's private fibre with < 2 ms latency.

Amazon's 99.99% Availability Target

99.99% availability means Amazon.com can be down for no more than 52.6 minutes per year. To achieve this, every critical service runs active replicas in at least 3 Availability Zones simultaneously. If one AZ fails completely (data centre flood, power failure), the remaining two AZs absorb all traffic within 60 seconds — with no data loss.


Section 10

Amazon's Caching Strategy — The Speed Multiplier

Every millisecond of latency on Amazon.com costs real money. A landmark internal study found that every 100 ms of additional latency costs 1% in sales. With billions of dollars in annual revenue, that is tens of millions of dollars per second of latency. Caching is not optional — it is survival.

L1
Browser Cache — 0 ms
Static assets (CSS, JavaScript, fonts, images) are served with aggressive Cache-Control: max-age=31536000 headers. Once a user has visited Amazon.com, their browser caches these assets for up to a year. Second visit loads the page with zero requests to Amazon's servers for static content.
L2
CloudFront Edge Cache — < 5 ms
Product images and pages are cached at the nearest CloudFront PoP. A user in Mumbai hits a CloudFront node in Mumbai — not servers in us-east-1. Cache TTL varies: product listings (5 minutes), product images (7 days), JavaScript bundles (1 year with cache-busting hashes).
L3
ElastiCache (Redis) — < 1 ms
Inside the data centre, frequently accessed data (session tokens, product details, cart contents, recommendation lists) live in Redis. A Redis GET takes < 1 ms — 100× faster than a DynamoDB read and 1,000× faster than a PostgreSQL query. Amazon uses Redis clusters with read replicas per AZ to survive zone failures.
L4
In-Process Memory Cache — < 0.01 ms
Hot configuration data (feature flags, A/B test assignments, service discovery endpoints) is cached in the application process's own memory. No network round-trip at all. Refreshed every 30–60 seconds from a central config store. The fastest cache is the one you never leave the CPU for.

Section 11

Auto Scaling — Surviving Prime Day

Prime Day 2023: 375 Million Items Sold in 48 Hours
Prime Day is Amazon's self-created Black Friday — a 48-hour event where traffic spikes to 5–10× normal volume within minutes of the event starting. In 2023, Amazon sold 375 million items, processed $12.7 billion in transactions, and fulfilled orders from a global network of 200+ fulfilment centres.

No human being added a single server during Prime Day. AWS Auto Scaling Groups monitored CPU utilisation, request count, queue depth, and network I/O in real-time — and automatically provisioned and deprovisioned capacity to match demand. The system went from "Tuesday afternoon quiet" to "planet-scale commercial event" and back — entirely automatically.
Auto Scaling Signal Threshold Action Response Time
CPU Utilisation > 70% for 3 min Add 2 instances < 90 seconds
SQS Queue Depth > 1,000 messages Scale consumers 2× < 60 seconds
ALB Request Rate > 10,000 req/s Pre-warm new AZ < 3 minutes
Memory Pressure > 85% RAM Evict cache + scale < 30 seconds
Scheduled Scaling Prime Day T−2h Pre-scale to 3× base Instant (planned)
CPU Utilisation < 30% for 10 min Terminate excess instances < 5 minutes
🎯
Predictive vs Reactive Scaling

Amazon uses both: Reactive scaling responds to real-time metrics as they cross thresholds. Predictive scaling uses ML to forecast traffic patterns based on historical data (same day last year, same event type, same time of day) and pre-provisions capacity before the demand spike arrives — because reactive scaling has a 90-second lag that can cause brief outages during instantaneous traffic spikes.


Section 12

Amazon's Security Architecture — Defence in Depth

Amazon operates under a Shared Responsibility Model: AWS secures the infrastructure, customers secure what they put on it. For Amazon's own retail application, every layer of the architecture has its own security controls — no single layer is trusted completely.

🔐
IAM — Identity & Access Management
Zero Trust Principals
Every AWS API call is authenticated and authorised via IAM. Lambda functions, ECS tasks, and EC2 instances assume IAM Roles with the minimum permissions needed. A compromised Cart service cannot read the Payments database — their IAM Roles are completely separate.
🛡️
AWS Shield Advanced
DDoS Protection
AWS Shield Advanced provides automatic DDoS mitigation at the network layer (L3/L4) and application layer (L7). Amazon's retail site absorbs multi-hundred-Gbps DDoS attacks routinely — the attack traffic is absorbed at the edge before it reaches any origin server.
🔑
KMS — Key Management Service
Encryption Everywhere
Every S3 bucket, DynamoDB table, RDS database, EBS volume, and SQS queue is encrypted at rest using KMS-managed keys. Every service-to-service call travels over TLS 1.3. Customer payment data is additionally encrypted with PCI-DSS compliant tokenisation before touching any database.

Section 13

Observability — Knowing What Is Happening at All Times

With thousands of microservices running simultaneously, "the site feels slow" is not a useful incident report. Amazon's engineering culture demands observable, measurable, attributable behaviour from every system.

📊 Amazon's Three Pillars of Observability
Metrics
Every service emits standardised CloudWatch metrics: request count, error rate, latency (p50/p95/p99), queue depth, cache hit rate. Dashboard alarms fire when any metric crosses a threshold. Prime Day dashboards are reviewed by 100+ engineers watching real-time panels during the event.
Logs
Every request is logged with a unique correlation ID (X-Amzn-Trace-Id) that flows through every service the request touches. CloudWatch Logs Insights lets engineers run SQL-like queries across petabytes of log data in seconds: "Show me all orders that touched the Payments service and had latency > 2s in the last 5 minutes."
Traces
AWS X-Ray provides distributed tracing: a visual flame chart of exactly which service, which function, and which database query added how much latency to a specific request. Finding the slow node in a chain of 12 microservices takes seconds, not hours.
Alarms
CloudWatch Alarms trigger PagerDuty → on-call engineer when error rates exceed SLOs. Amazon's target: every P1 incident is acknowledged within 5 minutes and has a workaround deployed within 30 minutes, 24/7/365.

Section 14

"Customers Who Bought This…" — The Recommendation Architecture

The Algorithm That Makes Amazon Billions
Amazon's recommendation engine — the "Customers who bought this also bought…" and "You might also like…" features — is directly responsible for an estimated 35% of Amazon's total revenue. This is not a side feature. It is the central nervous system of Amazon's retail business.

The original algorithm, invented by Amazon engineers Greg Linden, Brent Smith, and Jeremy York in 1998 and patented in 2001, is called item-based collaborative filtering. Instead of finding similar users (computationally expensive at Amazon scale), it pre-computes item-to-item similarity scores offline and serves them in real-time from a fast lookup table. The insight: the similarity matrix is computed once, stored in a fast data store, and looked up in milliseconds during page render.
🔄
Offline Training
Batch (every few hours)
Amazon SageMaker trains recommendation models on a rolling window of purchase and browsing history. Item-to-item similarity matrices are computed using collaborative filtering, content-based features, and deep learning embeddings. Training jobs run on spot instances for cost efficiency.
Online Serving
Real-time (< 30 ms)
The trained model is deployed behind a SageMaker Endpoint. When a product page loads, a feature vector (item ID + user context) is sent to the endpoint, which returns the top-N recommendations. The entire inference pipeline — feature lookup, model call, result ranking — takes < 30 ms.
🧪
A/B Testing
Always Experimenting
Amazon runs thousands of simultaneous A/B tests. Each user is assigned to experiment buckets via an experiment assignment service. Algorithm variant A vs B vs C are measured on click-through rate, add-to-cart rate, and final purchase conversion — not just impressions. The winner is deployed automatically when statistical significance is reached.

Section 15

The Physical Architecture — Fulfilment Centres & Last-Mile Delivery

Amazon's architecture is not just software. The physical fulfilment network is one of the most sophisticated logistics systems ever built — and its design mirrors the same principles as the software architecture: distributed, redundant, independently scalable, and event-driven.

Facility Type Size Purpose Count (approx.)
Fulfilment Centre (FC) 500,000–1M sq ft Receive, store, pick, pack, ship large items 200+ worldwide
Sortation Centre 200,000–400,000 sq ft Sort packages by delivery route/carrier 50+
Delivery Station 50,000–100,000 sq ft Last-mile: packages sorted for individual drivers 1,000+ (US alone)
Prime Now Hub 10,000–30,000 sq ft Ultra-fast 1–2 hour delivery of top-SKU items 30+ cities
Amazon Air Hub Custom airport facility Dedicated air cargo for same-day/next-day delivery 2 primary (CVG, AFW)
🤖
Amazon Robotics — Kiva to Proteus

In 2012, Amazon acquired Kiva Systems (robot manufacturer) for $775 million. Today, over 750,000 Kiva robots operate inside Amazon fulfilment centres. Instead of workers walking kilometres of aisles to pick items, robots bring the shelving pods to the workers. Average pick time per item: 15 minutes → 15 seconds. The newest generation — Proteus — is Amazon's first fully autonomous mobile robot, operating freely alongside human workers without safety cages.


Section 16

Key Architectural Patterns — The Reusable Blueprints

🔁
CQRS
Command Query Responsibility Segregation
Amazon separates write operations (commands — "place order") from read operations (queries — "get order list"). Writes go to a strongly-consistent DynamoDB table. Reads are served from a denormalised, pre-aggregated read model in ElastiCache. Different data stores, different scaling characteristics, same business data.
📋
Event Sourcing
Immutable Event Log
Order state is never updated in-place. Instead, every state change is appended as an immutable event: "OrderCreated", "PaymentReceived", "ShipmentDispatched", "DeliveryConfirmed". The current state is derived by replaying events. Enables full audit trail, temporal queries, and event replay for recovery.
🔒
Saga Pattern
Distributed Transactions
A checkout saga coordinates a distributed transaction across 5 services: Reserve Inventory → Charge Payment → Confirm Reservation → Create Shipment → Send Confirmation. Each step publishes a success or failure event. If any step fails, compensating transactions undo prior steps. No two-phase commit, no distributed locks.
🔄
Circuit Breaker
Failure Isolation
When the Recommendations service starts returning errors or timing out, the Circuit Breaker trips open: subsequent calls immediately return a fallback (empty recommendations) without waiting. The slow service cannot cascade its failure downstream. After 30 seconds, the breaker tries a probe request — if it succeeds, closes again.
🎯
Strangler Fig
Incremental Migration
How Amazon migrated from monolith to microservices without a big-bang rewrite: they built new services beside the monolith, then routed specific URL paths to the new service. Gradually, the monolith was "strangled" — one route at a time — until it was gone. No downtime. No rewrites. Just progressive replacement.
📫
Outbox Pattern
Guaranteed Event Delivery
When an Order service writes to its database, it also writes an event record to an "outbox" table in the same database transaction. A separate poller reads the outbox and publishes to SNS/SQS. Even if the app crashes after the DB write but before the publish, the event will be delivered — guaranteed exactly-once processing.

Section 17

Benefits, Use Cases & Trade-offs

Architectural Decision Benefit Trade-off When to Apply
Microservices Independent deploy, scale, fault isolation Distributed system complexity, network overhead Teams > 50 engineers, > 5 independent domains
DynamoDB (NoSQL) Unlimited scale, <5 ms latency, zero ops No ad-hoc queries, schema must be pre-planned Session, cart, catalogue, IoT, gaming leaderboards
Event-Driven (SQS/SNS) Decoupling, async, resilience, fan-out Eventual consistency, harder to debug Any workflow with 3+ downstream side effects
Lambda (Serverless) Zero infrastructure, auto-scale, pay-per-use Cold starts, 15-min timeout, stateless only Event processing, APIs, scheduled jobs, webhooks
Multi-AZ Deployment 99.99% availability, zero-downtime failover 2–3× cost, cross-AZ data transfer fees All production services with SLA commitments
ElastiCache (Redis) Sub-ms reads, 100× faster than DB Cache invalidation complexity, extra cost Session tokens, hot product data, leaderboards
CloudFront CDN Global low latency, DDoS absorption Cache invalidation lag (max 24h TTL) Static assets, product images, public API responses
Saga Pattern Distributed transactions without locking Complex compensating logic, eventual consistency Multi-service business transactions (checkout, booking)

Section 18

Golden Rules — Amazon's Architecture Principles

🏆 Amazon Architecture — Non-Negotiable Principles
1
Design for failure, not uptime. Assume every component — every server, every AZ, every third-party API — will fail. Build so that when (not if) it fails, the rest of the system degrades gracefully and continues serving customers, even at reduced capability.
2
API-first, always. No service may access another service's data store directly. All cross-service communication happens exclusively through versioned API contracts. This is the single most important rule for long-term architectural health.
3
Decouple with queues, not direct calls. Whenever a service produces a side effect that another service must handle, publish an event to a queue. Never make a synchronous call to a service whose failure or slowness you cannot tolerate.
4
Measure everything, trust nothing. Every service must emit metrics, logs, and traces. If you cannot measure a behaviour, you cannot own it. On-call engineers must be able to diagnose any incident purely from observability data — without asking anyone else.
5
The team that builds it runs it. No separate "operations" team. The engineering team that built the service owns its on-call, its SLAs, and its incident response. This alignment makes engineers care deeply about operational quality — because they are woken up at 3 a.m. when it breaks.
6
Prefer managed services over self-managed infrastructure. Running your own Kafka cluster, your own Kubernetes control plane, or your own MySQL primary-replica setup is undifferentiated heavy lifting. Use SQS, EKS, and RDS instead. Spend engineering time on features, not infrastructure management.
7
Always be experimenting. No architectural decision is permanent. Every pattern, every data store choice, every service boundary should be re-evaluated as the system grows. The strangler fig pattern exists precisely because perfect decisions made at 10 engineers become wrong decisions at 1,000 engineers.