Data Preparation / Data Preprocessing 📂 Data Collection · 4 of 13 47 min read

Matplotlib in Python

A hands-on guide to Python's most powerful plotting library — covering line charts, bar charts, scatter plots, histograms, heatmaps, subplots, violin plots, and publication-quality figure styling, with every chart rendered as a live output diagram.

Section 01

Introduction to Matplotlib

Matplotlib is Python's foundational plotting library — the engine under almost every visualisation in the scientific and data science ecosystem. Seaborn, pandas .plot(), and dozens of other libraries are all built on top of it. Understanding matplotlib directly gives you complete control over every pixel of every chart you produce.

💡
When to Use Matplotlib Directly

Use seaborn or plotly for quick EDA. Use matplotlib directly when you need pixel-precise control: custom tick formatters, dual axes, subplots with shared axes, inset charts, publication-quality figures with exact font sizes, or any layout that a higher-level library cannot produce. Matplotlib's learning curve pays off in unlimited flexibility.

The Anatomy of a Matplotlib Figure

🗺️ Matplotlib Figure Anatomy
Anatomy of a matplotlib figure showing Figure, Axes, Title, X/Y labels, ticks, spine, and legend Figure (fig = plt.figure()) Axes (ax = fig.add_subplot()) Chart Title ax.set_title() X-axis label (ax.set_xlabel()) Y-axis label (ax.set_ylabel()) Jan Mar Jun Sep Dec Spine (ax.spines) Series A Series B ax.legend() Grid

Every visual element in matplotlib is an object you can access and modify. The Figure is the canvas; the Axes is the actual plot area. One Figure can contain many Axes.

Setup and Imports

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import numpy as np
import pandas as pd

# Apply a style sheet globally
plt.style.use('seaborn-v0_8-darkgrid')   # clean dark grid look

# Set default figure size and resolution
plt.rcParams['figure.figsize']    = (10, 5)
plt.rcParams['figure.dpi']        = 120
plt.rcParams['axes.titlesize']    = 14
plt.rcParams['axes.labelsize']    = 12
plt.rcParams['xtick.labelsize']   = 10
plt.rcParams['ytick.labelsize']   = 10
plt.rcParams['legend.fontsize']   = 10
plt.rcParams['lines.linewidth']   = 2
📐
OO Interface vs pyplot Interface

Matplotlib has two interfaces. The pyplot interface (plt.plot()) is quick for single plots. The object-oriented interface (fig, ax = plt.subplots()) is the professional standard — always use it for anything more than a one-liner. It gives you explicit control over every axes object and avoids confusing state bugs.


Section 02

Line Chart — Trends Over Time

The line chart is matplotlib's most used plot. It connects data points in sequence — ideal for time series, continuous functions, and any data where order matters. Every visual property of the line (colour, width, style, markers) is independently controllable.

fig, ax = plt.subplots(figsize=(10, 5))

# Multiple lines with different styles
ax.plot(months, revenue,
       color='#60a5fa', linewidth=2, marker='o', markersize=5,
       label='Revenue')

ax.plot(months, target,
       color='#f59e0b', linewidth=1.5, linestyle='--',
       label='Target')

# Fill the area between lines
ax.fill_between(months, revenue, target,
               where=[r > t for r, t in zip(revenue, target)],
               alpha=0.15, color='#34d399', label='Above target')

ax.set_title('Monthly Revenue vs Target', fontsize=14, fontweight='bold', pad=12)
ax.set_xlabel('Month')
ax.set_ylabel('Revenue (₹ thousands)')
ax.legend(framealpha=0.3)
ax.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₹{x:.0f}k'))
plt.tight_layout()
plt.show()
📊 Output: Line chart — Monthly Revenue vs Target
Revenue Target (dashed) Above target

The green shaded area appears only where Revenue exceeds Target — created with ax.fill_between(where=[...]). Dashed line = target series using linestyle='--'.

🧮 Key Line Chart Parameters
color
Any hex, named colour, or RGB tuple. color='#60a5fa' or color='steelblue' or color=(0.2, 0.6, 1.0)
linestyle
'-' (solid), '--' (dashed), '-.' (dash-dot), ':' (dotted). Custom: linestyle=(0,(5,2,1,2))
marker
'o' circle, 's' square, '^' triangle, 'D' diamond, '+' plus, 'x' cross
fill_between
ax.fill_between(x, y1, y2, alpha=0.2, color='green') — fills the area between two lines

Section 03

Bar Chart — Comparing Categories

Bar charts compare a numeric measure across discrete categories. Matplotlib supports vertical bars, horizontal bars, grouped bars, and stacked bars — each with independent colour and width control per bar.

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# ── Grouped bar chart ──────────────────────────────
x = np.arange(len(categories))
width = 0.35

bars1 = axes[0].bar(x - width/2, male_avg,   width, label='Male',   color='#60a5fa', alpha=0.85)
bars2 = axes[0].bar(x + width/2, female_avg, width, label='Female', color='#f87171', alpha=0.85)

# Add value labels on top of each bar
axes[0].bar_label(bars1, fmt='₹%.0fk', padding=3, fontsize=9)
axes[0].bar_label(bars2, fmt='₹%.0fk', padding=3, fontsize=9)

axes[0].set_xticks(x)
axes[0].set_xticklabels(categories, rotation=30, ha='right')
axes[0].set_title('Avg Spend by Category & Gender')
axes[0].legend()

# ── Horizontal bar chart (sorted) ──────────────────
sorted_idx = np.argsort(total_revenue)
axes[1].barh(np.array(categories)[sorted_idx],
             np.array(total_revenue)[sorted_idx],
             color='#f59e0b', alpha=0.85)
axes[1].set_title('Total Revenue by Category')
plt.tight_layout()
plt.show()
📊 Output: Grouped bar (left) & Horizontal bar sorted (right)
Male Female
Total revenue (sorted)

Left: ax.bar_label() adds value annotations automatically. Right: ax.barh() with np.argsort() creates a sorted horizontal ranking chart.


Section 04

Scatter Plot — Relationships Between Variables

Scatter plots reveal correlations, clusters, and outliers. In matplotlib, the scatter() function encodes up to four dimensions simultaneously: x position, y position, colour (third variable), and size (fourth variable).

fig, ax = plt.subplots(figsize=(9, 6))

scatter = ax.scatter(
    df['age'], df['purchase_amount'],
    c=df['rating'],          # colour encodes a 3rd variable
    s=df['delivery_days'] * 8, # size encodes a 4th variable
    cmap='viridis',
    alpha=0.6, edgecolors='white', linewidths=0.4
)

# Colourbar for the 3rd dimension
cbar = plt.colorbar(scatter, ax=ax)
cbar.set_label('Customer Rating', fontsize=11)

# Annotate a specific outlier point
ax.annotate('High-value outlier',
            xy=(58, 24800), xytext=(45, 23000),
            arrowprops=dict(arrowstyle='->', color='#f87171'),
            fontsize=9, color='#f87171')

ax.set_xlabel('Customer Age')
ax.set_ylabel('Purchase Amount (₹)')
ax.set_title('Age vs Purchase — colour=rating, size=delivery days')
plt.tight_layout()
plt.show()
📊 Output: 4-variable scatter — x=age, y=purchase, colour=rating, size=delivery days

Four variables encoded in one chart. The ax.annotate() call adds an arrow pointing to the outlier. Colourbar created with plt.colorbar(scatter, ax=ax).

🧮 ax.annotate() — Pointing to Important Points
Basic
ax.annotate('text', xy=(x_point, y_point), xytext=(x_label, y_label))
Arrow styles
arrowprops=dict(arrowstyle='->', color='red', lw=1.5) — other styles: '-|>', 'fancy', 'simple'
Box around text
bbox=dict(boxstyle='round,pad=0.3', facecolor='#f59e0b', alpha=0.3)

Section 05

Histogram & KDE — Visualising Distributions

Matplotlib's ax.hist() is the foundation for distribution analysis. Combined with a manually computed KDE line, it gives you the full picture of a variable's shape — skewness, modality, and tail weight — all in one chart.

from scipy.stats import gaussian_kde

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# ── Left: histogram + KDE ──────────────────────────
n, bins, patches = axes[0].hist(
    data, bins=35, color='#60a5fa', alpha=0.6,
    edgecolor='white', linewidth=0.4, density=True
)

# Overlay KDE curve
kde = gaussian_kde(data)
x_range = np.linspace(data.min(), data.max(), 300)
axes[0].plot(x_range, kde(x_range), color='#f59e0b', linewidth=2, label='KDE')

# Colour bars by region (below mean = blue, above = amber)
mean_val = data.mean()
for patch, left_edge in zip(patches, bins):
    if left_edge > mean_val:
        patch.set_facecolor('#f59e0b')

axes[0].axvline(mean_val, color='#f59e0b', linestyle='--', label=f'Mean={mean_val:.0f}')
axes[0].axvline(np.median(data), color='#34d399', linestyle='--', label='Median')
axes[0].legend()
axes[0].set_title('Distribution of Purchase Amount')

# ── Right: overlapping histograms by group ──────────
for group, colour in zip(groups, ['#60a5fa', '#f87171', '#34d399']):
    axes[1].hist(group['data'], bins=25, alpha=0.5,
                color=colour, label=group['label'], density=True)
axes[1].legend()
axes[1].set_title('Distribution by Income Bracket')
plt.tight_layout()
plt.show()
📊 Output: Histogram + KDE (left) & Overlapping group histograms (right)
Below mean Above mean KDE
Low Medium High

Left: bars to the right of the mean (amber) show the right-skew. ax.axvline() draws vertical reference lines. Right: three overlapping histograms with alpha=0.5 reveal how income groups differ in spending.


Section 06

Subplots — Multiple Charts in One Figure

The plt.subplots() function creates a grid of axes objects. This is the core layout tool in matplotlib — it lets you create dashboards, comparison panels, and multi-panel analysis figures with precise shared axis control.

# 2×2 grid of subplots with shared x-axis
fig, axes = plt.subplots(2, 2, figsize=(12, 8),
                         sharex=False, sharey=False)

# Access individual axes
ax_line  = axes[0, 0]   # top-left
ax_bar   = axes[0, 1]   # top-right
ax_hist  = axes[1, 0]   # bottom-left
ax_scat  = axes[1, 1]   # bottom-right

# Add a shared super-title for the whole figure
fig.suptitle('Sales Dashboard — Q4 2024', fontsize=16, fontweight='bold', y=1.01)

# Remove unused axes: axes[1,2].set_visible(False)

# Adjust spacing between subplots
plt.tight_layout(pad=2.0)
plt.show()
📊 Output: 2×2 Subplot Grid — Sales Dashboard

SALES DASHBOARD — Q4 2024

Revenue Trend (Line)

Category Comparison (Bar)

Age Distribution (Histogram)

Age vs Spend (Scatter)

fig.suptitle() sets a shared title across all panels. Each subplot is an independent Axes object — style each one separately using its own ax variable.

🧮 Subplots Layout Cheatsheet
Basic grid
fig, axes = plt.subplots(nrows, ncols, figsize=(w, h))
Shared axes
plt.subplots(2, 1, sharex=True) — both panels share the same x-axis zoom/pan
Unequal sizes
fig, axes = plt.subplot_mosaic([['A','A'],['B','C']]) — top panel spans full width
Remove axes
axes[1,2].set_visible(False) — hides an unused panel in an odd-count grid
Spacing
plt.tight_layout(pad=2.0) — auto-adjusts spacing. plt.subplots_adjust(hspace=0.4) for manual control

Section 07

Heatmap with imshow() — Correlation & Pivot Tables

Matplotlib's ax.imshow() renders any 2D array as a coloured grid — the foundation for correlation heatmaps, pivot table visualisations, and confusion matrices. Unlike seaborn's heatmap(), the matplotlib version gives you full control over colormap, cell annotations, and axis formatting.

fig, ax = plt.subplots(figsize=(7, 6))

# Render the correlation matrix as an image
im = ax.imshow(corr_matrix, cmap='coolwarm', vmin=-1, vmax=1, aspect='auto')

# Add colourbar
plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)

# Annotate every cell with the correlation value
for i in range(n):
    for j in range(n):
        val = corr_matrix[i, j]
        text_color = 'white' if abs(val) > 0.6 else 'black'
        ax.text(j, i, f'{val:.2f}', ha='center', va='center',
               fontsize=11, color=text_color, fontweight='bold')

ax.set_xticks(range(n)); ax.set_xticklabels(feature_names, rotation=45, ha='right')
ax.set_yticks(range(n)); ax.set_yticklabels(feature_names)
ax.set_title('Correlation Matrix (Pearson)', pad=14)
plt.tight_layout()
plt.show()
📊 Output: ax.imshow() Correlation Heatmap with annotations

Cell text colour automatically switches between white and black based on background intensity — done with the if abs(val) > 0.6 conditional. The colourbar is added with plt.colorbar(im, ax=ax).


Section 08

Boxplot & Violin Plot — Distribution Shape

Matplotlib's ax.boxplot() and ax.violinplot() expose every visual detail of a distribution: median, quartiles, whiskers, and outliers. The violin plot extends this by showing the full distribution shape via a KDE on each side.

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# ── Styled boxplot ─────────────────────────────────
bp = ax1.boxplot(
    data_by_category,
    patch_artist=True,        # fills boxes with colour
    notch=True,              # notch = 95% CI around median
    vert=True,
    widths=0.5
)
# Style each box individually
colors = ['#60a5fa', '#34d399', '#f59e0b', '#a78bfa', '#f87171']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.6)

# ── Violin plot ─────────────────────────────────────
vp = ax2.violinplot(
    data_by_category,
    showmedians=True, showextrema=True
)
for body, color in zip(vp['bodies'], colors):
    body.set_facecolor(color)
    body.set_alpha(0.5)

plt.tight_layout()
plt.show()
📊 Output: Boxplot (left) & Violin plot (right)

Boxplot — patch_artist=True, notch=True

Violin Plot — showmedians=True

Violin plot is wider where most data lives — you can see bimodal distributions that a boxplot completely hides. Use notch=True on boxplots to show the 95% confidence interval around the median.


Section 09

Styling & Saving Publication-Quality Figures

Matplotlib's style system and fine-grained control over every element makes it the gold standard for publication figures. Journals, conference papers, and reports all demand specific font sizes, exact figure dimensions, and lossless export — matplotlib handles all of these.

# ── List all available styles ───────────────────────
print(plt.style.available)

# ── Apply a style ───────────────────────────────────
plt.style.use('seaborn-v0_8-whitegrid')  # clean white
plt.style.use('dark_background')         # pure dark
plt.style.use('ggplot')                   # R-style
plt.style.use('bmh')                      # Bayesian methods

# ── Custom rcParams for publication ─────────────────
plt.rcParams.update({
    'font.family':      'DejaVu Sans',
    'figure.dpi':       150,
    'savefig.dpi':      300,
    'axes.spines.top':  False,   # remove top spine
    'axes.spines.right':False,   # remove right spine
    'axes.grid':        True,
    'grid.alpha':       0.3,
})

# ── Saving figures ───────────────────────────────────
fig.savefig('chart.png',  dpi=300, bbox_inches='tight')
fig.savefig('chart.pdf',  bbox_inches='tight')         # vector PDF
fig.savefig('chart.svg',  bbox_inches='tight')         # scalable SVG
fig.savefig('chart.eps',  bbox_inches='tight')         # for LaTeX
📊 Output: Same data, four matplotlib styles compared

seaborn-v0_8-darkgrid

dark_background

ggplot

bmh (Bayesian Methods)

Same data, four completely different aesthetics. For publication use: bbox_inches='tight' prevents labels from being cropped. Use .pdf or .svg for vector output that scales to any size.


Section 10

Golden Rules of Matplotlib

🎯 7 Rules for Professional Matplotlib Figures
1
Always use the object-oriented interface: fig, ax = plt.subplots(). Never use stateful plt.plot() calls for anything beyond a quick throwaway chart — they cause confusing bugs in multi-axes figures.
2
Always call plt.tight_layout() before plt.show() or fig.savefig(). Without it, axis labels and titles routinely get cropped in saved figures.
3
Remove the top and right spines for clean publication figures: ax.spines[['top','right']].set_visible(False). The default four-sided box adds visual weight without adding information.
4
Always label axes with units. An axis labelled "Revenue" is ambiguous. An axis labelled "Revenue (₹ thousands)" is precise. Use ax.yaxis.set_major_formatter() for automatic unit formatting.
5
Save figures with dpi=300 and bbox_inches='tight' for print quality. Save as .svg or .pdf whenever the figure will be resized — raster formats (PNG) look blurry when scaled up.
6
Use ax.annotate() to add context directly on the chart — point to the anomaly, label the peak, explain the drop. A chart that requires a paragraph of caption text to explain has failed as a visualisation.
7
Set rcParams globally at the top of every notebook rather than styling each chart individually. Consistent font sizes, line widths, and colour schemes across all figures in a report signal professionalism and attention to detail.
🧮
Key Takeaway

Matplotlib's verbosity is its superpower. Every other Python visualisation library eventually hits a wall where it cannot produce exactly what you need — and the answer is always "drop down to matplotlib." Learn it deeply, and you will never be blocked by a chart requirement again.