Section 01
Why Visualise Data?
Numbers alone rarely tell the full story. A table of ten thousand rows can hide a perfect linear trend, a dangerous cluster of outliers, or a seasonal spike — invisible until you plot it. Data visualisation transforms raw numbers into a language the human brain understands instantly: shape, colour, and pattern.
💡
Anscombe's Quartet — Why You Must Always Plot
Four datasets with identical mean, variance, and correlation look completely different when plotted. The chart below demonstrates this — same statistics, four completely different shapes. This is the definitive proof that statistics without visualisation is dangerously incomplete.
📊 Anscombe's Quartet — Same Statistics, Four Different Shapes
Dataset I
Dataset II
Dataset III
Dataset IV
All four: mean x ≈ 9, mean y ≈ 7.5, variance ≈ 11, correlation ≈ 0.816
Section 02
Choosing the Right Chart
Every chart type is designed to answer a specific question. Match the chart to the question — not to aesthetics.
| Question You Are Asking |
Chart Type |
Best Library |
| What is the distribution of a single variable? | Histogram, KDE, Violin | seaborn |
| How do two numeric variables relate? | Scatter plot, Joint plot | seaborn / plotly |
| How does a value change over time? | Line chart, Area chart | plotly / matplotlib |
| How do categories compare? | Bar chart, Grouped bar | seaborn / plotly |
| What are correlations between many variables? | Heatmap, Pair plot | seaborn |
| Are there outliers in a numeric column? | Boxplot, Strip plot | seaborn |
| Where are clusters in my data? | Scatter with colour encoding | plotly / seaborn |
⚠️
Avoid Pie Charts for Data Science
Pie charts make it nearly impossible to compare segments accurately — the human eye cannot judge angles as well as lengths. A bar chart is almost always more readable, more precise, and more honest.
Section 03
Distribution Plots — Understanding Shape
Before any modelling, you must understand how each numeric variable is distributed. Distribution plots reveal shape, spread, skewness, and outliers at a glance.
Histogram with KDE Overlay
A histogram divides a variable's range into bins and counts values per bin. The KDE curve is a smooth probability density estimate overlaid on top — together they reveal the full shape of the distribution.
📊 Output: sns.histplot(df['purchase_amount'], bins=40, kde=True)
Frequency (bins)
KDE density curve
Right-skewed: the long tail to the right indicates a few very high-value purchases pulling the mean above the median.
🧮 Histogram — Code & Interpretation
Basic
sns.histplot(df['purchase_amount'], bins=40, kde=True)
With hue
sns.histplot(df, x='purchase_amount', hue='gender', kde=True, alpha=0.6)
Read it
Long right tail = right-skewed (mean > median). Two humps = bimodal — two subpopulations present. A narrow tall peak = low variance, very consistent values.
Boxplot — Visualising Spread and Outliers
A boxplot compresses a distribution into five numbers. Individual dots plotted beyond the whiskers are candidate outliers — each one deserves investigation before any decision is made to remove it.
📊 Output: sns.boxplot(data=df, x='product_category', y='purchase_amount')
IQR box
Median
Outliers
Electronics and Furniture show the widest IQR — highest variability in spend. Red dots are outlier values beyond 1.5×IQR.
🧮 Boxplot — Code
By category
sns.boxplot(data=df, x='product_category', y='purchase_amount', palette='muted')
Add data points
sns.stripplot(data=df, x='gender', y='rating', jitter=True, alpha=0.4, size=3)
Read it
Box shifted toward the bottom = right-skewed. Long one-sided whisker = outliers pulling in that direction. Many red dots beyond the fence = heavy-tailed distribution.
Section 04
Relationship Plots — Finding Connections
Relationship plots reveal how two or more variables move together. They are the primary tool for feature selection — if a feature correlates strongly with your target variable, it is likely useful for prediction.
Scatter Plot with Group Encoding
📊 Output: sns.scatterplot(data=df, x='age', y='purchase_amount', hue='income_bracket')
Low income
Medium income
High income
Very High
Positive trend: older customers with higher income tend to spend more. Clear colour separation between income groups confirms income is a strong predictor.
Correlation Heatmap
A heatmap of the correlation matrix shows every pairwise Pearson correlation as a coloured cell. This is the definitive tool for spotting multicollinearity before building a regression model.
📊 Output: sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')
−1.0 (negative) → 0 (none) → +1.0 (positive)
Income and purchase show the strongest positive correlation (0.78). Delivery_days and rating are negatively correlated (−0.44): slower delivery = lower rating.
🧮 Correlation Heatmap — Code
Compute
corr = df.select_dtypes('number').corr().round(2)
Plot (lower triangle only)
mask = np.triu(np.ones_like(corr, dtype=bool))
sns.heatmap(corr, mask=mask, annot=True, fmt='.2f', cmap='coolwarm', center=0, vmin=-1, vmax=1)
Read it
If two features correlate above 0.85 with each other, consider dropping one — multicollinearity inflates regression coefficients and makes them unstable.
Section 05
Time Series Visualisation
Time series plots answer three questions at once: is there a trend? Is there seasonality? Are there sudden anomalous spikes or drops that need investigation?
📊 Output: px.line(daily, x='order_date', y=['revenue','rolling_7d'])
Daily revenue
7-day rolling avg
Anomaly spike
Upward trend over the year with a clear November–December seasonality peak. The red marker in March is an anomaly — investigate before modelling.
📊 Output: sns.barplot(data=df, x='month', y='purchase_amount', estimator='mean')
November and December (highlighted amber) show a clear seasonal peak — any predictive model must account for this year-end pattern.
🧮 Time Series — Code
Prepare
df['order_date'] = pd.to_datetime(df['order_date'])
daily = df.groupby('order_date')['purchase_amount'].sum().reset_index()
Rolling avg
daily['rolling_7d'] = daily['purchase_amount'].rolling(7).mean()
Read it
Consistent upward slope = growth trend. Peaks at fixed intervals = seasonality. A sudden spike that does not repeat = anomaly requiring investigation.
Section 06
Identifying Patterns in Data
A pattern is any regularity that repeats, persists, or can be described by a rule. The three most important pattern types in data science are trends, seasonality, and clustering.
Trend
↗ Direction
- Long-term increase or decrease
- Seen in line charts over time
- Use rolling averages to smooth noise
- Test with regression line overlay
Clustering
⬡ Groups
- Natural groupings in scatter plots
- Separate clouds of points
- Use colour encoding to validate
- May suggest customer segments
Section 07
Identifying Anomalies & Outliers Visually
An anomaly is a data point that does not conform to the expected pattern. It can be a data entry error, a legitimate extreme value, or a genuinely interesting signal — like fraud or a sudden equipment failure. Visualisation is the fastest way to find them.
⚠️
Never Delete an Anomaly Without Investigating It
An outlier on a chart is not automatically a mistake. Before removing any extreme value, investigate its origin. Removing legitimate extreme values can seriously bias your model. Always document your decision.
📊 Output: Anomaly Detection — Z-Score Method (|z| > 3 flagged red)
Normal data points
Anomalies (|z| > 3)
8 anomalies detected (red). Isolated extremes at the top and bottom — are these VIP customers or data entry errors?
📊 Output: Time Series with Control Bands — Rolling Mean ± 2σ
Revenue
Rolling mean
±2σ control band
Spike detected
Points outside the green ±2σ band are flagged as spikes. The March spike exceeds the upper bound — likely a promotional event or data error.
🧮 Time Series Anomaly Detection — Code
Step 1
daily['rolling_mean'] = daily['revenue'].rolling(7).mean()
daily['rolling_std'] = daily['revenue'].rolling(7).std()
Step 2
daily['upper'] = daily['rolling_mean'] + 2 * daily['rolling_std']
daily['lower'] = daily['rolling_mean'] - 2 * daily['rolling_std']
daily['spike'] = (daily['revenue'] > daily['upper']) | (daily['revenue'] < daily['lower'])
Step 3
Plot with plotly and overlay spike points as red markers to make anomalies immediately visible to stakeholders.
Section 08
Golden Rules of Data Visualisation
🎯 7 Rules Every Data Scientist Should Follow
1
Always plot your data before computing any statistics. Anscombe's Quartet proved that identical statistics can hide completely different distributions. Your eyes see what numbers hide.
2
Match the chart to the question, not to personal preference. Histograms for distributions, scatter plots for relationships, line charts for time, bar charts for comparisons.
3
Always start bar chart y-axes at zero. A truncated axis makes a 2% difference look like a 100% change — the most common form of accidental data deception.
4
Use colour purposefully. Colour should encode information — a third variable, a category, an anomaly flag — not decorate. Decorative colour adds cognitive load without adding insight.
5
Never investigate an anomaly without first checking whether it is a data quality issue. Plotting bad data produces visually striking but completely meaningless patterns.
6
For EDA use seaborn. For dashboards and communication use plotly. For publication-quality figures use matplotlib. Each tool has its correct context.
7
A visualisation that requires more than five seconds to interpret has failed. Simplify: remove gridlines, reduce colours, and add a one-sentence annotation stating the insight directly on the chart.
🧮
Key Takeaway
Data visualisation is not decoration — it is a form of analysis. The best data scientists treat every chart as a hypothesis test: they ask what pattern they expect to see, plot the data, and compare assumption against reality. When those two things disagree, that gap is where the most valuable insight lives.