Amazon Architecture: AWS, DynamoDB & Microservices

Section 01

The Story That Explains Amazon's Architecture

📖 Real World Analogy

From a Single Garage to a City of a Million Warehouses

Imagine you open a bookshop in your garage. You do everything yourself — you take orders, pack boxes, handle returns, manage the accounts, and deliver on weekends. Business is good, so you hire a few people. Then a few more. One day you wake up and you have 10,000 employees, warehouses on five continents, and tens of millions of customers browsing simultaneously at 3 a.m.

The systems that worked in your garage — one spreadsheet, one phone, one shelf — collapse instantly at this scale. You need a completely different way of thinking about infrastructure, teams, and services.

That is exactly the journey Amazon made from 1994 to today. And the architectural decisions they were forced to make along the way became the blueprint that every large tech company now copies.

Amazon did not start with a clean-room architectural vision. They started with a monolith — one giant application where everything was tangled together. Every feature, every team, and every database was woven into a single codebase. Deploying a change to the homepage required the entire company to hold its breath.

⚠️

The Monolith Problem at Scale

By the early 2000s, Amazon's monolith had grown so large that a bug in the Wish List feature could take down the Checkout service. A slow database query in Reviews could delay order fulfilment. Every team was blocked by every other team. The system was a ticking time bomb — and Black Friday was approaching.

Section 02

Amazon's Architecture — The 30,000-Foot View

Amazon's architecture is best understood as three concentric rings, each solving a different problem at a different layer.

🏗️ Amazon Three-Ring Architecture — Live Diagram

INFRASTRUCTURE LAYER (AWS)

PLATFORM LAYER

CORE
SERVICES

🌍

CloudFront CDN

🗄️

S3 Storage

⚡

Lambda

🔁

SQS / SNS

🛡️

IAM / Shield

💾

RDS / DynamoDB

📦

Order Svc

🔍

Search Svc

🛒

Cart Svc

💳

Payment Svc

👤

User Svc

Core Domain Services

Platform Microservices

AWS Infrastructure

Each ring is independently scalable. A spike in Search traffic does not affect Payments. A DynamoDB update does not require redeploying the Cart service.

Section 03

The Two-Pizza Team Rule & Service-Oriented Architecture

📖 Story — The Jeff Bezos Mandate

"All teams will henceforth expose their data and functionality through service interfaces."

Around 2002, Jeff Bezos sent an internal memo that became legendary in software engineering circles. The memo had six points — the last of which simply said: "Anyone who doesn't do this will be fired."

The mandate was brutally simple: every team must expose its data through a service interface. No direct database calls from other teams. No shared memory. If you want another team's data, you call their API. Period.

This single memo restructured how Amazon built software — and accidentally invented the approach that would later power AWS, the most profitable cloud business in history.

Bezos also coined the Two-Pizza Rule: if a team cannot be fed with two pizzas, it is too large. Small teams own small services. Small services can be deployed, scaled, and killed independently.

🍕

Two-Pizza Teams

5–8 Engineers Max

Each team owns exactly one service end-to-end: design, build, deploy, monitor, and on-call. No committee approvals. No cross-team deployment gates. You build it, you run it.

🔌

API Contracts

The Only Interface

Teams communicate exclusively through well-defined API contracts. The implementation behind the API can change at any time, as long as the contract holds. Complete internal freedom, enforced external stability.

🧱

Service Isolation

Blast Radius Containment

When a service fails, only that service fails. The surrounding services degrade gracefully or fall back. The failure does not cascade. This is the entire point of the architecture — resilience through isolation.

⚡ Monolith vs Service-Oriented Architecture — Animated

❌ Monolith (2001)

🛒 Cart + Checkout

🔍 Search + Catalog

👤 User + Auth

💳 Payments + Billing

📦 Inventory + Shipping

⭐ Reviews + Recs

🗄️ ONE Shared Database

⚡ One deploy = entire site risk

✅ Service-Oriented (2004+)

🛒 Cart
Service

🔍 Search
Service

👤 User
Service

💳 Pay
Service

📦 Inventory
Service

⭐ Recs
Service

🔁 Event Bus (SQS / SNS / EventBridge)

✅ Each service deploys independently

The monolith (left) shakes as a whole when any single module is stressed. Each SOA node (right) floats independently — stress on Cart does not ripple to Payments.

Section 04

AWS — The Product That Came From Internal Pain

📖 Story — Pain Becomes Product

Amazon Accidentally Built the World's Biggest Cloud

Between 2000 and 2003, Amazon's engineering teams were drowning in undifferentiated heavy lifting. Every new service had to provision servers, configure networking, manage storage, and handle security from scratch. It was the same painful work every time.

They built internal tools to solve this. Then they standardised those tools into platforms. Then someone had an idea: what if we sell access to these platforms?

In 2006, Amazon launched S3 and EC2 — and AWS was born. Today AWS accounts for roughly $100 billion in annual revenue and more than two-thirds of Amazon's total operating profit. The "real" Amazon business — retail — is actually subsidised by the cloud business it built to survive its own growing pains.

📅

2006 — S3 & EC2

Cloud Storage + Compute

Amazon launches Simple Storage Service (S3) and Elastic Compute Cloud (EC2). The concept of renting elastic compute by the hour redefines the entire industry.

📅

2009 — RDS & VPC

Managed Databases + Networking

Relational Database Service removes the operational burden of running databases. Virtual Private Cloud lets enterprises isolate their AWS workloads like a private data centre.

📅

2014 — Lambda

Serverless Compute

AWS Lambda introduces the concept of functions-as-a-service. You write code. AWS handles every other aspect of its execution — servers, scaling, patching, billing per millisecond.

Section 05

Core AWS Services — What They Do & Why They Exist

🌐 A Request's Journey Through Amazon's Infrastructure

🌍

Step 1 — Route 53 (DNS)

A customer types amazon.com. Route 53 receives this DNS query and resolves it to the nearest healthy endpoint using latency-based routing. If an entire AWS region goes down, Route 53 automatically fails over to the next healthy region within 60 seconds.

🛡️

Step 2 — CloudFront + WAF (Edge)

The request hits Amazon's global edge network — 600+ Points of Presence in 90 countries. CloudFront serves cached static assets (images, JS, CSS) from the edge node nearest to the user. AWS WAF inspects every request for SQL injection, XSS, and known attack patterns before anything reaches the origin.

⚖️

Step 3 — Application Load Balancer

Dynamic requests pass through an Application Load Balancer (ALB). The ALB routes by URL path (/search/* → Search service, /cart/* → Cart service, /checkout/* → Checkout service). It health-checks every backend instance every 30 seconds and removes unhealthy ones instantly.

📦

Step 4 — ECS / EKS (Containers)

The request lands on a containerised microservice running in either Elastic Container Service (ECS) or Elastic Kubernetes Service (EKS). Each microservice is its own Docker container, with its own memory, CPU, and deployment lifecycle. Auto Scaling Groups automatically spawn more containers when CPU > 70% or queue depth rises.

💾

Step 5 — ElastiCache + DynamoDB / RDS

The service first checks ElastiCache (Redis) for a cached result. Cache hit → return in <1 ms. Cache miss → query DynamoDB (NoSQL, millisecond latency at any scale) or Aurora RDS (relational, for transactional data). DynamoDB alone handles tens of millions of requests per second during Prime Day.

🔁

Step 6 — SQS / SNS / EventBridge (Async)

After the synchronous response is returned to the user, asynchronous side-effects are published as events: "order placed" → SQS queue consumed by Inventory, Shipping, Notifications, Analytics services. No tight coupling. Each downstream service processes at its own pace. If one is slow, the queue buffers without losing a single event.

Each step is handled by a separate, independently scalable AWS service. A bottleneck at Step 5 never affects Step 2.

Section 06

DynamoDB — Amazon's Most Important Database Innovation

Amazon needed a database that could handle tens of millions of peak requests per second with single-digit millisecond latency — globally, without downtime. No existing database came close. So they built one.

📖 Story — The 2004 Paper That Changed Databases

Dynamo: Amazon's Highly Available Key-Value Store

Amazon's internal infrastructure team published a paper in 2007 describing the design of Dynamo, their internal key-value store. The paper introduced concepts — consistent hashing, vector clocks, sloppy quorums — that became the DNA of an entire generation of NoSQL databases including Cassandra, Riak, and eventually the public DynamoDB released in 2012.

The core insight: for shopping cart and session data, it is acceptable for different servers to have slightly different views of the data for a few milliseconds — as long as the system is always available and never blocks. Availability trumps consistency for most e-commerce workloads. This trade-off (known as the CAP theorem) was revolutionary.

🏗️ How DynamoDB Handles a Write at Planet Scale

Step 1

Partition Key Hashing: The partition key (e.g., userId#42) is hashed via a consistent hash function to determine which storage node owns this data.

Step 2

Leader Write: The write goes to the designated leader replica for that partition. The leader writes to its local storage and simultaneously replicates to two follower replicas across different AZs.

Step 3

Quorum ACK: Once 2 of 3 replicas confirm the write, success is returned to the client. This is quorum — tolerating one AZ failure without data loss or blocking.

Step 4

Global Tables: If Global Tables is enabled, the write is asynchronously replicated to all configured regions (e.g., us-east-1 → eu-west-1 → ap-southeast-1) within seconds. Active-active multi-region with last-writer-wins conflict resolution.

Result

End-to-end write latency: < 5 ms at the 99th percentile. Amazon's Prime Day: 105 trillion DynamoDB API calls in a single event.

Feature	DynamoDB	MySQL / PostgreSQL	MongoDB
Scaling Model	Horizontal (auto-sharding)	Vertical (read replicas)	Manual sharding
Peak RPS	Unlimited (auto)	Thousands (tuned)	Hundreds of thousands
Latency (p99)	< 5 ms	10–100 ms	5–50 ms
Operational Overhead	Zero (fully managed)	High (patches, vacuums, failover)	Medium
ACID Transactions	Yes (within 25 items)	Full ACID	Single-document ACID
Best For	Session, cart, gaming, IoT	Financial records, reporting	Semi-structured documents

Section 07

Event-Driven Architecture — How Amazon's Services Talk

When a customer places an order on Amazon, a cascade of things must happen: inventory must be decremented, a shipping label must be generated, a confirmation email must be sent, the analytics pipeline must record the sale, the recommendation engine must update the model. If all of this happened synchronously — waiting for each step in sequence — checkout would take 30 seconds.

💡

The Event-Driven Principle

Instead of Service A calling Service B calling Service C in a chain, Amazon's architecture uses events. Service A emits an event — "OrderPlaced" — and immediately returns to the user. Every downstream service that cares about orders subscribes to that event and processes it independently, asynchronously, at its own pace. No one waits for anyone.

📡 Order Placed — Event Fan-Out Animation

🛒 OrderPlaced
orderId: #A18-293847

SNS Topic → SQS Fan-Out

📦

Inventory Service
Decrement stock

SLA: < 500 ms

🚚

Shipping Service
Generate label

SLA: < 2 s

📧

Notification Svc
Send email/SMS

SLA: < 5 s

📊

Analytics Service
Record in warehouse

SLA: < 30 s

🎯

Recommendations
Update model

SLA: < 60 s

💳

Fraud Detection
Flag if anomaly

SLA: < 200 ms

All six services receive the same event concurrently. The user got their "Order confirmed" response before any of these downstream processes even started.

Section 08

AWS Lambda & Serverless — The Infrastructure Vanishes

Lambda is the most radical idea in Amazon's architecture: you stop thinking about servers entirely. You write a function. AWS handles everything else — the server, the operating system, the runtime, the network, the scaling, the health checks, the patching.

⚡

Cold Start

First Invocation Latency

When a Lambda function is invoked for the first time (or after a period of inactivity), AWS must provision a compute environment, download your code, and initialise the runtime. This "cold start" adds 100–500 ms. Provisioned Concurrency eliminates it by keeping containers warm.

📈

Scaling

Infinite Concurrent Execution

Lambda scales to 1,000 concurrent executions per region by default (soft limit; can be raised to millions). Each request gets its own isolated execution environment. 1 request and 1 million requests have identical per-request performance characteristics.

💰

Pricing

Pay Per 1 ms of Execution

You pay only for the compute time you consume — rounded to the nearest millisecond. A function that runs for 200 ms costs exactly 200 ms of compute. When no functions are running, you pay nothing. A traditional EC2 instance costs money even when idle at 3 a.m.

⚡ Lambda Execution Lifecycle — Animated Step by Step

INIT

Cold Start — Environment Provisioning (Cold path only)

AWS spins up a new micro-VM (Firecracker), downloads your deployment package from S3, extracts it, and initialises the runtime (Node.js, Python, Java, etc). This happens once per container instance. Duration: 100–800 ms depending on runtime and package size.

CODE

Module Initialisation (Cold path only)

Your function's module-level code runs — database connection pools are created, configuration is loaded, heavy dependencies are required. This is the code outside your handler function. With Java or large Python packages, this can add 1–3 seconds.

RUN

Handler Execution (Every invocation)

Your actual handler function runs. This is the only part that executes on every invocation. This is what you are billed for — measured to the nearest 1 ms. Billing begins when your handler starts and ends when it returns (or times out).

WARM

Container Kept Warm (Subsequent invocations)

After execution, AWS keeps the container alive for 5–15 minutes. The next invocation reuses the same container, skipping the cold start entirely. Database connections established in the init phase are reused. This is the "warm start" path — adds zero overhead.

IDLE

Container Frozen & Recycled

After inactivity, AWS freezes and eventually recycles the container. No cost is incurred during this idle period. The next invocation will experience a cold start again. Solution: Provisioned Concurrency pre-initialises N containers permanently — eliminating cold starts at a fixed cost.

The cold start path (blue) only happens once per container. All subsequent invocations (amber → green) reuse the warm container and only execute your handler.

Section 09

Regions, Availability Zones & the Geography of Resilience

📖 Story — The Great Northeast Blackout

Why One Data Centre Is Never Enough

On 14 August 2003, a software bug in an energy management system triggered a cascade failure that blacked out the entire northeast United States and parts of Canada — 55 million people, 256 power plants, including data centres that had no idea the grid was about to disappear.

Amazon had experienced their own infrastructure catastrophes. They knew that any single physical location could fail for reasons completely outside their control: flooding, fires, power cuts, earthquakes, fibre cuts, hardware failures, or simply someone tripping over a cable.

Their answer: design for failure. Never deploy anything to a single physical location. Assume every component will fail, and architect accordingly.

🌏 AWS Region → Availability Zone → Data Centre Hierarchy

🌍 Region: us-east-1 (N. Virginia) — one of 33 AWS Regions globally

AZ-1a 📍

Data Centre A

Data Centre B

AZ-1b 📍

Data Centre C

Data Centre D

AZ-1c 📍

Data Centre E

Data Centre F

AWS Regions worldwide

105+

Availability Zones

600+

CloudFront Edge PoPs

AZs within a region are physically separated by ≥ 100 km with independent power, cooling, and networking — but connected by Amazon's private fibre with < 2 ms latency.

✅

Amazon's 99.99% Availability Target

99.99% availability means Amazon.com can be down for no more than 52.6 minutes per year. To achieve this, every critical service runs active replicas in at least 3 Availability Zones simultaneously. If one AZ fails completely (data centre flood, power failure), the remaining two AZs absorb all traffic within 60 seconds — with no data loss.

Section 10

Amazon's Caching Strategy — The Speed Multiplier

Every millisecond of latency on Amazon.com costs real money. A landmark internal study found that every 100 ms of additional latency costs 1% in sales. With billions of dollars in annual revenue, that is tens of millions of dollars per second of latency. Caching is not optional — it is survival.

Browser Cache — 0 ms

Static assets (CSS, JavaScript, fonts, images) are served with aggressive Cache-Control: max-age=31536000 headers. Once a user has visited Amazon.com, their browser caches these assets for up to a year. Second visit loads the page with zero requests to Amazon's servers for static content.

CloudFront Edge Cache — < 5 ms

Product images and pages are cached at the nearest CloudFront PoP. A user in Mumbai hits a CloudFront node in Mumbai — not servers in us-east-1. Cache TTL varies: product listings (5 minutes), product images (7 days), JavaScript bundles (1 year with cache-busting hashes).

ElastiCache (Redis) — < 1 ms

Inside the data centre, frequently accessed data (session tokens, product details, cart contents, recommendation lists) live in Redis. A Redis GET takes < 1 ms — 100× faster than a DynamoDB read and 1,000× faster than a PostgreSQL query. Amazon uses Redis clusters with read replicas per AZ to survive zone failures.

In-Process Memory Cache — < 0.01 ms

Hot configuration data (feature flags, A/B test assignments, service discovery endpoints) is cached in the application process's own memory. No network round-trip at all. Refreshed every 30–60 seconds from a central config store. The fastest cache is the one you never leave the CPU for.

Section 11

Auto Scaling — Surviving Prime Day

📖 Story — The Day Everything Had to Work

Prime Day 2023: 375 Million Items Sold in 48 Hours

Prime Day is Amazon's self-created Black Friday — a 48-hour event where traffic spikes to 5–10× normal volume within minutes of the event starting. In 2023, Amazon sold 375 million items, processed $12.7 billion in transactions, and fulfilled orders from a global network of 200+ fulfilment centres.

No human being added a single server during Prime Day. AWS Auto Scaling Groups monitored CPU utilisation, request count, queue depth, and network I/O in real-time — and automatically provisioned and deprovisioned capacity to match demand. The system went from "Tuesday afternoon quiet" to "planet-scale commercial event" and back — entirely automatically.

Auto Scaling Signal	Threshold	Action	Response Time
CPU Utilisation	> 70% for 3 min	Add 2 instances	< 90 seconds
SQS Queue Depth	> 1,000 messages	Scale consumers 2×	< 60 seconds
ALB Request Rate	> 10,000 req/s	Pre-warm new AZ	< 3 minutes
Memory Pressure	> 85% RAM	Evict cache + scale	< 30 seconds
Scheduled Scaling	Prime Day T−2h	Pre-scale to 3× base	Instant (planned)
CPU Utilisation	< 30% for 10 min	Terminate excess instances	< 5 minutes

🎯

Predictive vs Reactive Scaling

Amazon uses both: Reactive scaling responds to real-time metrics as they cross thresholds. Predictive scaling uses ML to forecast traffic patterns based on historical data (same day last year, same event type, same time of day) and pre-provisions capacity before the demand spike arrives — because reactive scaling has a 90-second lag that can cause brief outages during instantaneous traffic spikes.

Section 12

Amazon's Security Architecture — Defence in Depth

Amazon operates under a Shared Responsibility Model: AWS secures the infrastructure, customers secure what they put on it. For Amazon's own retail application, every layer of the architecture has its own security controls — no single layer is trusted completely.

🔐

IAM — Identity & Access Management

Zero Trust Principals

Every AWS API call is authenticated and authorised via IAM. Lambda functions, ECS tasks, and EC2 instances assume IAM Roles with the minimum permissions needed. A compromised Cart service cannot read the Payments database — their IAM Roles are completely separate.

🛡️

AWS Shield Advanced

DDoS Protection

AWS Shield Advanced provides automatic DDoS mitigation at the network layer (L3/L4) and application layer (L7). Amazon's retail site absorbs multi-hundred-Gbps DDoS attacks routinely — the attack traffic is absorbed at the edge before it reaches any origin server.

🔑

KMS — Key Management Service

Encryption Everywhere

Every S3 bucket, DynamoDB table, RDS database, EBS volume, and SQS queue is encrypted at rest using KMS-managed keys. Every service-to-service call travels over TLS 1.3. Customer payment data is additionally encrypted with PCI-DSS compliant tokenisation before touching any database.

Section 13

Observability — Knowing What Is Happening at All Times

With thousands of microservices running simultaneously, "the site feels slow" is not a useful incident report. Amazon's engineering culture demands observable, measurable, attributable behaviour from every system.

📊 Amazon's Three Pillars of Observability

Metrics

Every service emits standardised CloudWatch metrics: request count, error rate, latency (p50/p95/p99), queue depth, cache hit rate. Dashboard alarms fire when any metric crosses a threshold. Prime Day dashboards are reviewed by 100+ engineers watching real-time panels during the event.

Logs

Every request is logged with a unique correlation ID (X-Amzn-Trace-Id) that flows through every service the request touches. CloudWatch Logs Insights lets engineers run SQL-like queries across petabytes of log data in seconds: "Show me all orders that touched the Payments service and had latency > 2s in the last 5 minutes."

Traces

AWS X-Ray provides distributed tracing: a visual flame chart of exactly which service, which function, and which database query added how much latency to a specific request. Finding the slow node in a chain of 12 microservices takes seconds, not hours.

Alarms

CloudWatch Alarms trigger PagerDuty → on-call engineer when error rates exceed SLOs. Amazon's target: every P1 incident is acknowledged within 5 minutes and has a workaround deployed within 30 minutes, 24/7/365.

Section 14

"Customers Who Bought This…" — The Recommendation Architecture

📖 Story — 35% of Amazon's Revenue

The Algorithm That Makes Amazon Billions

Amazon's recommendation engine — the "Customers who bought this also bought…" and "You might also like…" features — is directly responsible for an estimated 35% of Amazon's total revenue. This is not a side feature. It is the central nervous system of Amazon's retail business.

The original algorithm, invented by Amazon engineers Greg Linden, Brent Smith, and Jeremy York in 1998 and patented in 2001, is called item-based collaborative filtering. Instead of finding similar users (computationally expensive at Amazon scale), it pre-computes item-to-item similarity scores offline and serves them in real-time from a fast lookup table. The insight: the similarity matrix is computed once, stored in a fast data store, and looked up in milliseconds during page render.

🔄

Offline Training

Batch (every few hours)

Amazon SageMaker trains recommendation models on a rolling window of purchase and browsing history. Item-to-item similarity matrices are computed using collaborative filtering, content-based features, and deep learning embeddings. Training jobs run on spot instances for cost efficiency.

⚡

Online Serving

Real-time (< 30 ms)

The trained model is deployed behind a SageMaker Endpoint. When a product page loads, a feature vector (item ID + user context) is sent to the endpoint, which returns the top-N recommendations. The entire inference pipeline — feature lookup, model call, result ranking — takes < 30 ms.

🧪

A/B Testing

Always Experimenting

Amazon runs thousands of simultaneous A/B tests. Each user is assigned to experiment buckets via an experiment assignment service. Algorithm variant A vs B vs C are measured on click-through rate, add-to-cart rate, and final purchase conversion — not just impressions. The winner is deployed automatically when statistical significance is reached.

Section 15

The Physical Architecture — Fulfilment Centres & Last-Mile Delivery

Amazon's architecture is not just software. The physical fulfilment network is one of the most sophisticated logistics systems ever built — and its design mirrors the same principles as the software architecture: distributed, redundant, independently scalable, and event-driven.

Facility Type	Size	Purpose	Count (approx.)
Fulfilment Centre (FC)	500,000–1M sq ft	Receive, store, pick, pack, ship large items	200+ worldwide
Sortation Centre	200,000–400,000 sq ft	Sort packages by delivery route/carrier	50+
Delivery Station	50,000–100,000 sq ft	Last-mile: packages sorted for individual drivers	1,000+ (US alone)
Prime Now Hub	10,000–30,000 sq ft	Ultra-fast 1–2 hour delivery of top-SKU items	30+ cities
Amazon Air Hub	Custom airport facility	Dedicated air cargo for same-day/next-day delivery	2 primary (CVG, AFW)

🤖

Amazon Robotics — Kiva to Proteus

In 2012, Amazon acquired Kiva Systems (robot manufacturer) for $775 million. Today, over 750,000 Kiva robots operate inside Amazon fulfilment centres. Instead of workers walking kilometres of aisles to pick items, robots bring the shelving pods to the workers. Average pick time per item: 15 minutes → 15 seconds. The newest generation — Proteus — is Amazon's first fully autonomous mobile robot, operating freely alongside human workers without safety cages.

Section 16

Key Architectural Patterns — The Reusable Blueprints

🔁

CQRS

Command Query Responsibility Segregation

Amazon separates write operations (commands — "place order") from read operations (queries — "get order list"). Writes go to a strongly-consistent DynamoDB table. Reads are served from a denormalised, pre-aggregated read model in ElastiCache. Different data stores, different scaling characteristics, same business data.

📋

Event Sourcing

Immutable Event Log

Order state is never updated in-place. Instead, every state change is appended as an immutable event: "OrderCreated", "PaymentReceived", "ShipmentDispatched", "DeliveryConfirmed". The current state is derived by replaying events. Enables full audit trail, temporal queries, and event replay for recovery.

🔒

Saga Pattern

Distributed Transactions

A checkout saga coordinates a distributed transaction across 5 services: Reserve Inventory → Charge Payment → Confirm Reservation → Create Shipment → Send Confirmation. Each step publishes a success or failure event. If any step fails, compensating transactions undo prior steps. No two-phase commit, no distributed locks.

🔄

Circuit Breaker

Failure Isolation

When the Recommendations service starts returning errors or timing out, the Circuit Breaker trips open: subsequent calls immediately return a fallback (empty recommendations) without waiting. The slow service cannot cascade its failure downstream. After 30 seconds, the breaker tries a probe request — if it succeeds, closes again.

🎯

Strangler Fig

Incremental Migration

How Amazon migrated from monolith to microservices without a big-bang rewrite: they built new services beside the monolith, then routed specific URL paths to the new service. Gradually, the monolith was "strangled" — one route at a time — until it was gone. No downtime. No rewrites. Just progressive replacement.

📫

Outbox Pattern

Guaranteed Event Delivery

When an Order service writes to its database, it also writes an event record to an "outbox" table in the same database transaction. A separate poller reads the outbox and publishes to SNS/SQS. Even if the app crashes after the DB write but before the publish, the event will be delivered — guaranteed exactly-once processing.

Section 17

Benefits, Use Cases & Trade-offs

Architectural Decision	Benefit	Trade-off	When to Apply
Microservices	Independent deploy, scale, fault isolation	Distributed system complexity, network overhead	Teams > 50 engineers, > 5 independent domains
DynamoDB (NoSQL)	Unlimited scale, <5 ms latency, zero ops	No ad-hoc queries, schema must be pre-planned	Session, cart, catalogue, IoT, gaming leaderboards
Event-Driven (SQS/SNS)	Decoupling, async, resilience, fan-out	Eventual consistency, harder to debug	Any workflow with 3+ downstream side effects
Lambda (Serverless)	Zero infrastructure, auto-scale, pay-per-use	Cold starts, 15-min timeout, stateless only	Event processing, APIs, scheduled jobs, webhooks
Multi-AZ Deployment	99.99% availability, zero-downtime failover	2–3× cost, cross-AZ data transfer fees	All production services with SLA commitments
ElastiCache (Redis)	Sub-ms reads, 100× faster than DB	Cache invalidation complexity, extra cost	Session tokens, hot product data, leaderboards
CloudFront CDN	Global low latency, DDoS absorption	Cache invalidation lag (max 24h TTL)	Static assets, product images, public API responses
Saga Pattern	Distributed transactions without locking	Complex compensating logic, eventual consistency	Multi-service business transactions (checkout, booking)

Section 18

Golden Rules — Amazon's Architecture Principles

🏆 Amazon Architecture — Non-Negotiable Principles

Design for failure, not uptime. Assume every component — every server, every AZ, every third-party API — will fail. Build so that when (not if) it fails, the rest of the system degrades gracefully and continues serving customers, even at reduced capability.

API-first, always. No service may access another service's data store directly. All cross-service communication happens exclusively through versioned API contracts. This is the single most important rule for long-term architectural health.

Decouple with queues, not direct calls. Whenever a service produces a side effect that another service must handle, publish an event to a queue. Never make a synchronous call to a service whose failure or slowness you cannot tolerate.

Measure everything, trust nothing. Every service must emit metrics, logs, and traces. If you cannot measure a behaviour, you cannot own it. On-call engineers must be able to diagnose any incident purely from observability data — without asking anyone else.

The team that builds it runs it. No separate "operations" team. The engineering team that built the service owns its on-call, its SLAs, and its incident response. This alignment makes engineers care deeply about operational quality — because they are woken up at 3 a.m. when it breaks.

Prefer managed services over self-managed infrastructure. Running your own Kafka cluster, your own Kubernetes control plane, or your own MySQL primary-replica setup is undifferentiated heavy lifting. Use SQS, EKS, and RDS instead. Spend engineering time on features, not infrastructure management.

Always be experimenting. No architectural decision is permanent. Every pattern, every data store choice, every service boundary should be re-evaluated as the system grows. The strangler fig pattern exists precisely because perfect decisions made at 10 engineers become wrong decisions at 1,000 engineers.