Data Preparation / Data Preprocessing 📂 Data Collection · 3 of 13 32 min read

Data Visualization & Pattern Detection

A comprehensive guide to visualising data using Python's most powerful libraries — matplotlib, seaborn, and plotly — covering chart selection, identifying patterns, trends, and anomalies with real code examples.

Section 01

Why Visualise Data?

Numbers alone rarely tell the full story. A table of ten thousand rows can hide a perfect linear trend, a dangerous cluster of outliers, or a seasonal spike — invisible until you plot it. Data visualisation transforms raw numbers into a language the human brain understands instantly: shape, colour, and pattern.

💡
Anscombe's Quartet — Why You Must Always Plot

Four datasets with identical mean, variance, and correlation look completely different when plotted. The chart below demonstrates this — same statistics, four completely different shapes. This is the definitive proof that statistics without visualisation is dangerously incomplete.

📊 Anscombe's Quartet — Same Statistics, Four Different Shapes
Dataset I Dataset II Dataset III Dataset IV

All four: mean x ≈ 9, mean y ≈ 7.5, variance ≈ 11, correlation ≈ 0.816


Section 02

Choosing the Right Chart

Every chart type is designed to answer a specific question. Match the chart to the question — not to aesthetics.

Question You Are Asking Chart Type Best Library
What is the distribution of a single variable?Histogram, KDE, Violinseaborn
How do two numeric variables relate?Scatter plot, Joint plotseaborn / plotly
How does a value change over time?Line chart, Area chartplotly / matplotlib
How do categories compare?Bar chart, Grouped barseaborn / plotly
What are correlations between many variables?Heatmap, Pair plotseaborn
Are there outliers in a numeric column?Boxplot, Strip plotseaborn
Where are clusters in my data?Scatter with colour encodingplotly / seaborn
⚠️
Avoid Pie Charts for Data Science

Pie charts make it nearly impossible to compare segments accurately — the human eye cannot judge angles as well as lengths. A bar chart is almost always more readable, more precise, and more honest.


Section 03

Distribution Plots — Understanding Shape

Before any modelling, you must understand how each numeric variable is distributed. Distribution plots reveal shape, spread, skewness, and outliers at a glance.

Histogram with KDE Overlay

A histogram divides a variable's range into bins and counts values per bin. The KDE curve is a smooth probability density estimate overlaid on top — together they reveal the full shape of the distribution.

📊 Output: sns.histplot(df['purchase_amount'], bins=40, kde=True)
Frequency (bins) KDE density curve

Right-skewed: the long tail to the right indicates a few very high-value purchases pulling the mean above the median.

🧮 Histogram — Code & Interpretation
Basic
sns.histplot(df['purchase_amount'], bins=40, kde=True)
With hue
sns.histplot(df, x='purchase_amount', hue='gender', kde=True, alpha=0.6)
Read it
Long right tail = right-skewed (mean > median). Two humps = bimodal — two subpopulations present. A narrow tall peak = low variance, very consistent values.

Boxplot — Visualising Spread and Outliers

A boxplot compresses a distribution into five numbers. Individual dots plotted beyond the whiskers are candidate outliers — each one deserves investigation before any decision is made to remove it.

Box Structure
Q1 │ Median │ Q3
The box spans the IQR (middle 50% of data). The centre line is the median — not the mean.
Whiskers & Outliers
Q1 − 1.5×IQR / Q3 + 1.5×IQR
Points beyond the fences are plotted individually as outliers — each warrants investigation.
📊 Output: sns.boxplot(data=df, x='product_category', y='purchase_amount')
IQR box Median Outliers

Electronics and Furniture show the widest IQR — highest variability in spend. Red dots are outlier values beyond 1.5×IQR.

🧮 Boxplot — Code
By category
sns.boxplot(data=df, x='product_category', y='purchase_amount', palette='muted')
Add data points
sns.stripplot(data=df, x='gender', y='rating', jitter=True, alpha=0.4, size=3)
Read it
Box shifted toward the bottom = right-skewed. Long one-sided whisker = outliers pulling in that direction. Many red dots beyond the fence = heavy-tailed distribution.

Section 04

Relationship Plots — Finding Connections

Relationship plots reveal how two or more variables move together. They are the primary tool for feature selection — if a feature correlates strongly with your target variable, it is likely useful for prediction.

Scatter Plot with Group Encoding

📊 Output: sns.scatterplot(data=df, x='age', y='purchase_amount', hue='income_bracket')
Low income Medium income High income Very High

Positive trend: older customers with higher income tend to spend more. Clear colour separation between income groups confirms income is a strong predictor.

Correlation Heatmap

A heatmap of the correlation matrix shows every pairwise Pearson correlation as a coloured cell. This is the definitive tool for spotting multicollinearity before building a regression model.

📊 Output: sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
−1.0 (negative) → 0 (none) → +1.0 (positive)

Income and purchase show the strongest positive correlation (0.78). Delivery_days and rating are negatively correlated (−0.44): slower delivery = lower rating.

🧮 Correlation Heatmap — Code
Compute
corr = df.select_dtypes('number').corr().round(2)
Plot (lower triangle only)
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', center=0, vmin=-1, vmax=1)
Read it
If two features correlate above 0.85 with each other, consider dropping one — multicollinearity inflates regression coefficients and makes them unstable.

Section 05

Time Series Visualisation

Time series plots answer three questions at once: is there a trend? Is there seasonality? Are there sudden anomalous spikes or drops that need investigation?

📊 Output: px.line(daily, x='order_date', y=['revenue','rolling_7d'])
Daily revenue 7-day rolling avg Anomaly spike

Upward trend over the year with a clear November–December seasonality peak. The red marker in March is an anomaly — investigate before modelling.

📊 Output: sns.barplot(data=df, x='month', y='purchase_amount', estimator='mean')

November and December (highlighted amber) show a clear seasonal peak — any predictive model must account for this year-end pattern.

🧮 Time Series — Code
Prepare
df['order_date'] = pd.to_datetime(df['order_date'])
daily = df.groupby('order_date')['purchase_amount'].sum().reset_index()
Rolling avg
daily['rolling_7d'] = daily['purchase_amount'].rolling(7).mean()
Read it
Consistent upward slope = growth trend. Peaks at fixed intervals = seasonality. A sudden spike that does not repeat = anomaly requiring investigation.

Section 06

Identifying Patterns in Data

A pattern is any regularity that repeats, persists, or can be described by a rule. The three most important pattern types in data science are trends, seasonality, and clustering.

Trend
↗ Direction
  • Long-term increase or decrease
  • Seen in line charts over time
  • Use rolling averages to smooth noise
  • Test with regression line overlay
Seasonality
⟳ Cycles
  • Repeating pattern at fixed intervals
  • Daily, weekly, monthly, or yearly
  • Visible with month / weekday charts
  • Isolate with seasonal decomposition
Clustering
⬡ Groups
  • Natural groupings in scatter plots
  • Separate clouds of points
  • Use colour encoding to validate
  • May suggest customer segments

Section 07

Identifying Anomalies & Outliers Visually

An anomaly is a data point that does not conform to the expected pattern. It can be a data entry error, a legitimate extreme value, or a genuinely interesting signal — like fraud or a sudden equipment failure. Visualisation is the fastest way to find them.

⚠️
Never Delete an Anomaly Without Investigating It

An outlier on a chart is not automatically a mistake. Before removing any extreme value, investigate its origin. Removing legitimate extreme values can seriously bias your model. Always document your decision.

📊 Output: Anomaly Detection — Z-Score Method (|z| > 3 flagged red)
Normal data points Anomalies (|z| > 3)

8 anomalies detected (red). Isolated extremes at the top and bottom — are these VIP customers or data entry errors?

📊 Output: Time Series with Control Bands — Rolling Mean ± 2σ
Revenue Rolling mean ±2σ control band Spike detected

Points outside the green ±2σ band are flagged as spikes. The March spike exceeds the upper bound — likely a promotional event or data error.

🧮 Time Series Anomaly Detection — Code
Step 1
daily['rolling_mean'] = daily['revenue'].rolling(7).mean()
daily['rolling_std'] = daily['revenue'].rolling(7).std()
Step 2
daily['upper'] = daily['rolling_mean'] + 2 * daily['rolling_std']
daily['lower'] = daily['rolling_mean'] - 2 * daily['rolling_std']
daily['spike'] = (daily['revenue'] > daily['upper']) | (daily['revenue'] < daily['lower'])
Step 3
Plot with plotly and overlay spike points as red markers to make anomalies immediately visible to stakeholders.

Section 08

Golden Rules of Data Visualisation

🎯 7 Rules Every Data Scientist Should Follow
1
Always plot your data before computing any statistics. Anscombe's Quartet proved that identical statistics can hide completely different distributions. Your eyes see what numbers hide.
2
Match the chart to the question, not to personal preference. Histograms for distributions, scatter plots for relationships, line charts for time, bar charts for comparisons.
3
Always start bar chart y-axes at zero. A truncated axis makes a 2% difference look like a 100% change — the most common form of accidental data deception.
4
Use colour purposefully. Colour should encode information — a third variable, a category, an anomaly flag — not decorate. Decorative colour adds cognitive load without adding insight.
5
Never investigate an anomaly without first checking whether it is a data quality issue. Plotting bad data produces visually striking but completely meaningless patterns.
6
For EDA use seaborn. For dashboards and communication use plotly. For publication-quality figures use matplotlib. Each tool has its correct context.
7
A visualisation that requires more than five seconds to interpret has failed. Simplify: remove gridlines, reduce colours, and add a one-sentence annotation stating the insight directly on the chart.
🧮
Key Takeaway

Data visualisation is not decoration — it is a form of analysis. The best data scientists treat every chart as a hypothesis test: they ask what pattern they expect to see, plot the data, and compare assumption against reality. When those two things disagree, that gap is where the most valuable insight lives.