Data Visualization in Python: Charts, Patterns & Anomaly.

Section 01

Why Visualise Data?

Numbers alone rarely tell the full story. A table of ten thousand rows can hide a perfect linear trend, a dangerous cluster of outliers, or a seasonal spike — invisible until you plot it. Data visualisation transforms raw numbers into a language the human brain understands instantly: shape, colour, and pattern.

💡

Anscombe's Quartet — Why You Must Always Plot

Four datasets with identical mean, variance, and correlation look completely different when plotted. The chart below demonstrates this — same statistics, four completely different shapes. This is the definitive proof that statistics without visualisation is dangerously incomplete.

📊 Anscombe's Quartet — Same Statistics, Four Different Shapes

Dataset I Dataset II Dataset III Dataset IV

All four: mean x ≈ 9, mean y ≈ 7.5, variance ≈ 11, correlation ≈ 0.816

Section 02

Choosing the Right Chart

Every chart type is designed to answer a specific question. Match the chart to the question — not to aesthetics.

Question You Are Asking	Chart Type	Best Library
What is the distribution of a single variable?	Histogram, KDE, Violin	seaborn
How do two numeric variables relate?	Scatter plot, Joint plot	seaborn / plotly
How does a value change over time?	Line chart, Area chart	plotly / matplotlib
How do categories compare?	Bar chart, Grouped bar	seaborn / plotly
What are correlations between many variables?	Heatmap, Pair plot	seaborn
Are there outliers in a numeric column?	Boxplot, Strip plot	seaborn
Where are clusters in my data?	Scatter with colour encoding	plotly / seaborn

⚠️

Avoid Pie Charts for Data Science

Pie charts make it nearly impossible to compare segments accurately — the human eye cannot judge angles as well as lengths. A bar chart is almost always more readable, more precise, and more honest.

Section 03

Distribution Plots — Understanding Shape

Before any modelling, you must understand how each numeric variable is distributed. Distribution plots reveal shape, spread, skewness, and outliers at a glance.

Histogram with KDE Overlay

A histogram divides a variable's range into bins and counts values per bin. The KDE curve is a smooth probability density estimate overlaid on top — together they reveal the full shape of the distribution.

📊 Output: sns.histplot(df['purchase_amount'], bins=40, kde=True)

Frequency (bins) KDE density curve

Right-skewed: the long tail to the right indicates a few very high-value purchases pulling the mean above the median.

🧮 Histogram — Code & Interpretation

Basic

sns.histplot(df['purchase_amount'], bins=40, kde=True)

With hue

sns.histplot(df, x='purchase_amount', hue='gender', kde=True, alpha=0.6)

Read it

Long right tail = right-skewed (mean > median). Two humps = bimodal — two subpopulations present. A narrow tall peak = low variance, very consistent values.

Boxplot — Visualising Spread and Outliers

A boxplot compresses a distribution into five numbers. Individual dots plotted beyond the whiskers are candidate outliers — each one deserves investigation before any decision is made to remove it.

Box Structure

Q1 │ Median │ Q3

The box spans the IQR (middle 50% of data). The centre line is the median — not the mean.

Whiskers & Outliers

Q1 − 1.5×IQR / Q3 + 1.5×IQR

Points beyond the fences are plotted individually as outliers — each warrants investigation.

📊 Output: sns.boxplot(data=df, x='product_category', y='purchase_amount')

IQR box Median Outliers

Electronics and Furniture show the widest IQR — highest variability in spend. Red dots are outlier values beyond 1.5×IQR.

🧮 Boxplot — Code

By category

sns.boxplot(data=df, x='product_category', y='purchase_amount', palette='muted')

Add data points

sns.stripplot(data=df, x='gender', y='rating', jitter=True, alpha=0.4, size=3)

Read it

Box shifted toward the bottom = right-skewed. Long one-sided whisker = outliers pulling in that direction. Many red dots beyond the fence = heavy-tailed distribution.

Section 04

Relationship Plots — Finding Connections

Relationship plots reveal how two or more variables move together. They are the primary tool for feature selection — if a feature correlates strongly with your target variable, it is likely useful for prediction.

Scatter Plot with Group Encoding

📊 Output: sns.scatterplot(data=df, x='age', y='purchase_amount', hue='income_bracket')

Low income Medium income High income Very High

Positive trend: older customers with higher income tend to spend more. Clear colour separation between income groups confirms income is a strong predictor.

Correlation Heatmap

A heatmap of the correlation matrix shows every pairwise Pearson correlation as a coloured cell. This is the definitive tool for spotting multicollinearity before building a regression model.

📊 Output: sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')

−1.0 (negative) → 0 (none) → +1.0 (positive)

Income and purchase show the strongest positive correlation (0.78). Delivery_days and rating are negatively correlated (−0.44): slower delivery = lower rating.

🧮 Correlation Heatmap — Code

Compute

corr = df.select_dtypes('number').corr().round(2)

Plot (lower triangle only)

mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', center=0, vmin=-1, vmax=1)

Read it

If two features correlate above 0.85 with each other, consider dropping one — multicollinearity inflates regression coefficients and makes them unstable.

Section 05

Time Series Visualisation

Time series plots answer three questions at once: is there a trend? Is there seasonality? Are there sudden anomalous spikes or drops that need investigation?

📊 Output: px.line(daily, x='order_date', y=['revenue','rolling_7d'])

Daily revenue 7-day rolling avg Anomaly spike

Upward trend over the year with a clear November–December seasonality peak. The red marker in March is an anomaly — investigate before modelling.

📊 Output: sns.barplot(data=df, x='month', y='purchase_amount', estimator='mean')

November and December (highlighted amber) show a clear seasonal peak — any predictive model must account for this year-end pattern.

🧮 Time Series — Code

Prepare

df['order_date'] = pd.to_datetime(df['order_date'])
daily = df.groupby('order_date')['purchase_amount'].sum().reset_index()

Rolling avg

daily['rolling_7d'] = daily['purchase_amount'].rolling(7).mean()

Read it

Consistent upward slope = growth trend. Peaks at fixed intervals = seasonality. A sudden spike that does not repeat = anomaly requiring investigation.

Section 06

Identifying Patterns in Data

A pattern is any regularity that repeats, persists, or can be described by a rule. The three most important pattern types in data science are trends, seasonality, and clustering.

Trend

↗ Direction

Long-term increase or decrease
Seen in line charts over time
Use rolling averages to smooth noise
Test with regression line overlay

Seasonality

⟳ Cycles

Repeating pattern at fixed intervals
Daily, weekly, monthly, or yearly
Visible with month / weekday charts
Isolate with seasonal decomposition

Clustering

⬡ Groups

Natural groupings in scatter plots
Separate clouds of points
Use colour encoding to validate
May suggest customer segments

Section 07

Identifying Anomalies & Outliers Visually

An anomaly is a data point that does not conform to the expected pattern. It can be a data entry error, a legitimate extreme value, or a genuinely interesting signal — like fraud or a sudden equipment failure. Visualisation is the fastest way to find them.

⚠️

Never Delete an Anomaly Without Investigating It

An outlier on a chart is not automatically a mistake. Before removing any extreme value, investigate its origin. Removing legitimate extreme values can seriously bias your model. Always document your decision.

📊 Output: Anomaly Detection — Z-Score Method (|z| > 3 flagged red)

Normal data points Anomalies (|z| > 3)

8 anomalies detected (red). Isolated extremes at the top and bottom — are these VIP customers or data entry errors?

📊 Output: Time Series with Control Bands — Rolling Mean ± 2σ

Revenue Rolling mean ±2σ control band Spike detected

Points outside the green ±2σ band are flagged as spikes. The March spike exceeds the upper bound — likely a promotional event or data error.

🧮 Time Series Anomaly Detection — Code

Step 1

daily['rolling_mean'] = daily['revenue'].rolling(7).mean()
daily['rolling_std'] = daily['revenue'].rolling(7).std()

Step 2

daily['upper'] = daily['rolling_mean'] + 2 * daily['rolling_std']
daily['lower'] = daily['rolling_mean'] - 2 * daily['rolling_std']
daily['spike'] = (daily['revenue'] > daily['upper']) | (daily['revenue'] < daily['lower'])

Step 3

Plot with plotly and overlay spike points as red markers to make anomalies immediately visible to stakeholders.

Section 08

Golden Rules of Data Visualisation

🎯 7 Rules Every Data Scientist Should Follow

Always plot your data before computing any statistics. Anscombe's Quartet proved that identical statistics can hide completely different distributions. Your eyes see what numbers hide.

Match the chart to the question, not to personal preference. Histograms for distributions, scatter plots for relationships, line charts for time, bar charts for comparisons.

Always start bar chart y-axes at zero. A truncated axis makes a 2% difference look like a 100% change — the most common form of accidental data deception.

Use colour purposefully. Colour should encode information — a third variable, a category, an anomaly flag — not decorate. Decorative colour adds cognitive load without adding insight.

Never investigate an anomaly without first checking whether it is a data quality issue. Plotting bad data produces visually striking but completely meaningless patterns.

For EDA use seaborn. For dashboards and communication use plotly. For publication-quality figures use matplotlib. Each tool has its correct context.

A visualisation that requires more than five seconds to interpret has failed. Simplify: remove gridlines, reduce colours, and add a one-sentence annotation stating the insight directly on the chart.

🧮

Key Takeaway

Data visualisation is not decoration — it is a form of analysis. The best data scientists treat every chart as a hypothesis test: they ask what pattern they expect to see, plot the data, and compare assumption against reality. When those two things disagree, that gap is where the most valuable insight lives.