Data Preparation / Data Preprocessing 📂 Data Collection · 2 of 13 31 min read

Exploratory Data Analysis (EDA) with Pandas

A hands-on guide to reading datasets, understanding their structure, data types, and computing essential summary statistics — the foundation of every data science project.

Section 01

What Is Exploratory Data Analysis (EDA)?

Exploratory Data Analysis is the critical first step every data scientist takes before building any model or drawing any conclusion. Before you can trust your data, you must understand it — its shape, its quirks, its missing pieces, and the relationships hiding inside it.

EDA is not a single function or a checklist. It is a mindset: ask questions, look at the data from multiple angles, let the data surprise you. The statistician John Tukey, who coined the term in 1977, described it as "detective work" — you are not yet proving anything, you are forming hypotheses.

💡
Why EDA Comes Before Modelling

Studies show that data scientists spend 60–80% of their time on data understanding and preparation — not on modelling. A model built without EDA risks learning from corrupted, imbalanced, or misunderstood data. EDA is the foundation on which every valid insight is built.

A complete EDA workflow covers three pillars: reading the dataset, understanding its structure and data types, and computing summary statistics. This tutorial walks you through all three with real pandas code.


Section 02

The EDA Pipeline — Step by Step

Think of EDA as a structured investigation. Each step answers a different question about your data. Here is the pipeline we will follow in this tutorial:

01
Load the Dataset
Read data from CSV, Excel, JSON, SQL, or Parquet into a pandas DataFrame using pd.read_*() functions.
02
Inspect Shape & Structure
Check dimensions, column names, and data types using .shape, .columns, .dtypes, and .info().
03
Preview Raw Data
Look at actual rows with .head(), .tail(), and .sample() to spot obvious issues — wrong formats, unexpected values, junk rows.
04
Check for Missing Data
Identify nulls and their patterns using .isnull().sum(). Decide whether to drop, fill, or flag missing values.
05
Summary Statistics
Compute measures of centre and spread — mean, median, variance, standard deviation — using .describe() and individual aggregation methods.
06
Distributions & Relationships
Visualise distributions (histograms, boxplots) and feature correlations (heatmaps, scatter plots) to identify patterns, outliers, and skewness.
📐
This Tutorial Covers Steps 01 – 05

Steps 01–05 form the numerical foundation of EDA. Visualisation (Step 06) is covered in a separate tutorial on matplotlib and seaborn.


Section 03

Reading the Dataset

Setting Up Your Environment

All EDA in Python begins with importing pandas. NumPy is almost always imported alongside it for numerical operations.

import pandas as pd
import numpy as np

# Optional: make pandas display more columns / rows
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)

Reading Different File Formats

Pandas provides a unified API for loading almost any file format. The result is always the same: a DataFrame — a two-dimensional, labelled table.

File Format Function Key Parameters
CSV pd.read_csv() sep, header, index_col, parse_dates
Excel (.xlsx) pd.read_excel() sheet_name, skiprows, usecols
JSON pd.read_json() orient, lines, dtype
SQL Database pd.read_sql() query, con (connection object)
Parquet pd.read_parquet() columns, engine ('pyarrow' / 'fastparquet')
HTML Table pd.read_html() match, header, index_col

Practical Read Examples

# ── Reading a CSV file ──────────────────────────────────
df = pd.read_csv('sales_data.csv')

# CSV with a custom separator (e.g. tab-delimited)
df = pd.read_csv('data.tsv', sep='\t')

# Parse date columns automatically
df = pd.read_csv('orders.csv', parse_dates=['order_date', 'ship_date'])

# Load only specific columns (faster for large files)
df = pd.read_csv('big_file.csv', usecols=['age', 'income', 'target'])

# Read in chunks (for files too large for RAM)
chunks = []
for chunk in pd.read_csv('huge.csv', chunksize=10000):
    chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)

# ── Reading from Excel ──────────────────────────────────
df = pd.read_excel('report.xlsx', sheet_name='Q3_Sales')

# ── Reading from a SQL database (SQLite example) ────────
import sqlite3
conn = sqlite3.connect('mydb.db')
df = pd.read_sql('SELECT * FROM customers WHERE country = "India"', conn)
conn.close()

# ── Reading a Parquet file ──────────────────────────────
df = pd.read_parquet('data.parquet')
⚠️
Always Specify Encoding for CSV Files

Files with non-ASCII characters (Hindi, Arabic, accented European letters) may throw a UnicodeDecodeError. Add encoding='utf-8' or try encoding='latin-1' as a fallback: pd.read_csv('file.csv', encoding='utf-8').


Section 04

Understanding Dataset Structure

Once data is loaded, the first task is orientation: how big is this? What are the columns? Are the types what I expect? The following five commands form the standard inspection toolkit.

Shape, Columns & Basic Preview

# How many rows and columns?
print(df.shape)           # (rows, columns)

# What are the column names?
print(df.columns.tolist())

# First 5 rows
df.head()

# Last 5 rows
df.tail()

# 5 random rows — great for spotting patterns without cherry-picking
df.sample(5, random_state=42)
▶ Output — df.shape
(10000, 12)
💡
Why Use .sample() Instead of .head()?

.head() always shows the first rows — which are often sorted, clean, or representative of only one category. .sample() gives you a random cross-section, which is much more likely to reveal issues like mixed data types, stray characters, or outlier values buried in the middle of the dataset.

The Master Inspection Command: .info()

df.info() is the single most informative summary you can get about a DataFrame. It shows column names, data types, non-null counts, and memory usage — all in one call.

df.info()
▶ Output
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 customer_id 10000 non-null int64 1 name 10000 non-null object 2 age 9843 non-null float64 ← 157 nulls! 3 gender 10000 non-null object 4 city 9991 non-null object 5 purchase_amount 10000 non-null float64 6 product_category 10000 non-null object 7 order_date 10000 non-null datetime64[ns] 8 is_returned 10000 non-null bool 9 rating 9720 non-null float64 ← 280 nulls! 10 income_bracket 10000 non-null category 11 delivery_days 10000 non-null int64 dtypes: bool(1), category(1), datetime64[ns](1), float64(3), int64(2), object(3) memory usage: 781.4 KB
🧮 How to Read .info() Output
Check 1
Non-Null Count vs Total Rows — If a column shows fewer non-null entries than the total rows, it has missing values. Here: age has 157 nulls, rating has 280 nulls.
Check 2
Dtype correctness — Is object where you expect datetime64? Is a numeric column stored as object? Fix with pd.to_datetime() or pd.to_numeric().
Check 3
Memory Usage — Columns typed as object use far more RAM than category. Converting low-cardinality columns can cut memory usage by up to 90%.

Section 05

Data Types in Pandas

Every column in a DataFrame has a dtype — a data type that determines how values are stored, how much memory they use, and what operations are valid on them. Understanding and correcting dtypes is one of the most impactful steps in EDA.

int64 / int32
np.int64 · np.int32 · Int64 (nullable)
Whole numbers. No decimals, no nulls (unless using nullable Int64). Used for counts, IDs, discrete quantities.
float64 / float32
np.float64 · np.float32
Decimal numbers. Also used when a column has NaN values, since NaN is a float. Used for prices, measurements, probabilities.
object
Python str · mixed types
Default for text columns. Also appears when a column has mixed types. High memory usage — convert to category when possible.
bool
True / False · 1 / 0
Binary flags. Very memory-efficient. Used for binary outcomes like is_churned, is_returned, has_discount.
datetime64[ns]
pd.Timestamp · pd.DatetimeTZDtype
Date and time values with nanosecond precision. Enables time-based slicing, resampling, and feature extraction (.dt.year, .dt.month).
category
pd.CategoricalDtype
Optimised for columns with few unique values. Stores only the unique labels once, then uses an integer index. Much faster for groupby.

Checking and Fixing Data Types

# View the dtype of every column
print(df.dtypes)

# ── TYPE CONVERSIONS ────────────────────────────────────

# Convert object → datetime
df['order_date'] = pd.to_datetime(df['order_date'])

# Convert object → numeric (coerce turns unparseable values to NaN)
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Convert object → category (saves memory for low-cardinality columns)
df['gender'] = df['gender'].astype('category')

# Convert float → int (only safe if no NaN values remain)
df['delivery_days'] = df['delivery_days'].astype('int32')

# ── DATETIME FEATURE EXTRACTION ─────────────────────────
df['order_year']  = df['order_date'].dt.year
df['order_month'] = df['order_date'].dt.month
df['day_of_week'] = df['order_date'].dt.day_name()
🎯
Memory Optimisation Rule of Thumb

If a column has fewer than 50 unique values, convert it to category. If an integer column fits within 32-bit range, use int32 instead of int64. These two changes alone can halve the RAM usage of large DataFrames.

Checking Unique Values and Cardinality

# How many unique values does each column have?
df.nunique()

# Frequency count with proportions
df['product_category'].value_counts(normalize=True)

# Check for duplicated rows
print(f"Duplicate rows: {df.duplicated().sum()}")

# Drop duplicates
df = df.drop_duplicates()
▶ Output — df.nunique()
customer_id 10000 name 9997 age 67 gender 3 ← Male, Female, Other city 142 purchase_amount 8842 product_category 8 order_date 3651 is_returned 2 ← True / False rating 9 ← 1.0 to 5.0 income_bracket 4 ← Low, Medium, High, Very High delivery_days 30 dtype: int64

Section 06

Identifying Missing Data

Missing values are one of the most common data quality problems. Pandas represents them as NaN (Not a Number) for numeric columns and NaT for datetime columns. Before computing any statistics, you must understand the extent and pattern of missingness.

# Total null count per column
df.isnull().sum()

# Percentage of nulls per column
null_pct = (df.isnull().sum() / len(df) * 100).round(2)
print(null_pct[null_pct > 0])   # show only columns with nulls
▶ Output — null percentage
age 1.57% rating 2.80% city 0.09% dtype: float64
Missing % Range Recommended Strategy Pandas Method
< 5% Fill with mean / median / mode, or drop rows .fillna(df['col'].median())
5% – 20% Impute using model-based methods (KNN, regression) sklearn.impute.KNNImputer
20% – 50% Add a missing indicator column, then impute df['col_missing'] = df['col'].isnull().astype(int)
> 50% Consider dropping the column entirely df.drop(columns=['col'])

Section 07

Summary Statistics

Summary statistics condense an entire column of data into a handful of numbers that describe its centre, spread, and shape. They are the numerical backbone of EDA — before you plot anything, these numbers tell you what to expect.

The Six Core Statistics

⚖️
Mean
x̄ / μ
Sum of all values divided by count. Sensitive to outliers — one extreme value pulls it significantly.
🎯
Median
Q2 / M
The middle value when data is sorted. Robust to outliers — always use for skewed distributions like income or house prices.
📌
Mode
Mo
The most frequently occurring value. The only central tendency measure that works for categorical data.
📐
Variance
σ² / s²
Average squared deviation from the mean. Hard to interpret directly — use standard deviation instead.
📊
Std Dev
σ / s
Square root of variance. In the same units as the data — interpretable as the typical distance from the mean.
📏
IQR
Q3 − Q1
Interquartile range: the spread of the middle 50% of data. Robust to outliers. The standard metric for boxplot whiskers.

The Formulas

Mean
x̄ = (Σxᵢ) / n
Sum of all observations divided by the number of observations.
Sample Variance
s² = Σ(xᵢ − x̄)² / (n−1)
Divides by n−1 (Bessel's correction) for unbiased estimation from a sample.
Standard Deviation
s = √s²
Square root of variance. Returns spread to the original unit of measurement.
IQR
IQR = Q3 − Q1
Q1 = 25th percentile. Q3 = 75th percentile. Captures the middle 50% of the distribution.
📌
Population vs Sample — The n−1 Rule

Pandas uses sample statistics by default (ddof=1, divides by n−1). This is almost always correct — you are working with a sample, not the entire population. To get population statistics, pass ddof=0: df['col'].var(ddof=0).

Computing Statistics with Pandas

# ── INDIVIDUAL STATISTICS ────────────────────────────────
df['purchase_amount'].mean()
df['purchase_amount'].median()
df['purchase_amount'].mode()[0]
df['purchase_amount'].var()
df['purchase_amount'].std()
df['purchase_amount'].min()
df['purchase_amount'].max()

# ── PERCENTILES & IQR ───────────────────────────────────
Q1  = df['purchase_amount'].quantile(0.25)
Q3  = df['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1

# ── CUSTOM AGGREGATION ──────────────────────────────────
df['purchase_amount'].agg(['mean', 'median', 'std', 'min', 'max'])

# Multiple columns, multiple stats
df[['purchase_amount', 'rating', 'delivery_days']].agg({
    'purchase_amount': ['mean', 'median', 'std'],
    'rating': ['mean', 'min', 'max'],
    'delivery_days': ['mean', 'median']
})

The All-in-One Command: .describe()

# Numeric columns only (default)
df.describe()

# All columns including categoricals
df.describe(include='all')

# Add custom percentiles
df.describe(percentiles=[0.05, 0.25, 0.50, 0.75, 0.95])
▶ Output — df.describe()
purchase_amount age rating delivery_days count 10000.00 9843.00 9720.00 10000.00 mean 6842.37 34.21 3.72 5.43 std 4201.85 9.87 1.02 3.11 min 100.00 18.00 1.00 1.00 25% 3210.50 27.00 3.00 3.00 50% 6500.00 34.00 4.00 5.00 75% 9875.25 41.00 5.00 7.00 max 24999.00 65.00 5.00 15.00
🧮 How to Read .describe() Output
count
Number of non-null values. Compare to total rows to identify missing data. Here: age has only 9843 / 10000 — 157 missing.
mean vs 50%
If mean ≠ median, the distribution is skewed. Here: purchase_amount mean (6842) > median (6500) → slight right skew.
std
Standard deviation = typical spread. Purchase amount std of 4201 on a mean of 6842 (CV = 61%) means high variation — check for outliers.
min / max
The extremes. Are they plausible? A min age of 18 and max of 65 looks reasonable — always validate with domain knowledge.
25% / 75%
IQR = 75th − 25th percentile. For purchase_amount: 9875 − 3210 = 6665. A tight IQR suggests homogeneous data; wide IQR signals spread.

Section 08

Grouped & Conditional Statistics

Overall statistics rarely tell the full story. The most powerful EDA insights come from computing statistics by group. Pandas' .groupby() makes this effortless.

# Mean purchase amount by product category
df.groupby('product_category')['purchase_amount'].mean().sort_values(ascending=False)

# Multiple stats by gender
df.groupby('gender')['purchase_amount'].agg(['mean', 'median', 'std', 'count'])

# Group by multiple columns
df.groupby(['gender', 'product_category'])['purchase_amount'].mean().unstack()

# Correlation matrix — all numeric columns
df.corr(numeric_only=True).round(2)
▶ Output — purchase_amount by product_category (mean, sorted)
product_category Electronics 12840.55 Furniture 9230.10 Clothing 5120.85 Books 1840.30 Grocery 620.75 dtype: float64
💡
When Mean ≠ Median — The Skewness Signal

When the mean is substantially higher than the median, data is right-skewed. Common in income, purchase amounts, and website traffic. In such cases, the median is a more honest description of the typical value. Use df['col'].skew() to quantify: values above +1 or below −1 indicate significant skew.


Section 09

Detecting Outliers

An outlier is a value that lies far from the bulk of the data. Outliers can be legitimate extreme values, data entry errors, or measurement errors. EDA must identify them before modelling.

The IQR Method (Robust, Non-parametric)

def detect_outliers_iqr(series):
    Q1  = series.quantile(0.25)
    Q3  = series.quantile(0.75)
    IQR = Q3 - Q1
    lower   = Q1 - 1.5 * IQR
    upper   = Q3 + 1.5 * IQR
    outliers = series[(series < lower) | (series > upper)]
    print(f"Lower fence : {lower:.2f}")
    print(f"Upper fence : {upper:.2f}")
    print(f"Outliers    : {len(outliers)} ({len(outliers)/len(series)*100:.1f}%)")
    return outliers

outliers = detect_outliers_iqr(df['purchase_amount'])

The Z-Score Method (Assumes Normality)

from scipy import stats

z_scores = stats.zscore(df['purchase_amount'].dropna())
outliers_z = df[abs(z_scores) > 3]
print(f"Z-score outliers: {len(outliers_z)}")
Method Threshold Assumes Normality Best For
IQR Method 1.5 × IQR from Q1/Q3 No Skewed data, general use
Z-Score |z| > 3 Yes Roughly normal distributions
Modified Z-Score |MAD score| > 3.5 No Small datasets, robust
Isolation Forest Anomaly score < threshold No Multivariate outliers

Section 10

Complete EDA Template — Copy & Use

Paste this function into your notebook and replace df with your dataset. It runs a full numerical EDA in one call.

import pandas as pd
import numpy as np

def run_eda(df, dataset_name="Dataset"):
    print(f"\n{'='*55}")
    print(f"  EDA REPORT: {dataset_name}")
    print(f"{'='*55}\n")

    # 1. SHAPE
    print(f"[1] Shape: {df.shape[0]:,} rows × {df.shape[1]} columns\n")

    # 2. DATA TYPES
    print("[2] Data Types:")
    print(df.dtypes.to_string(), "\n")

    # 3. MISSING VALUES
    null_df = pd.DataFrame({
        'Missing': df.isnull().sum(),
        '%': (df.isnull().mean() * 100).round(2)
    })
    null_df = null_df[null_df['Missing'] > 0]
    print("[3] Missing Values:")
    print(null_df.to_string() if len(null_df) > 0 else "None — clean!", "\n")

    # 4. DUPLICATES
    print(f"[4] Duplicated rows: {df.duplicated().sum():,}\n")

    # 5. SUMMARY STATISTICS
    print("[5] Summary Statistics (numeric):")
    print(df.describe(percentiles=[.05,.25,.50,.75,.95]).round(2).to_string(), "\n")

    # 6. CATEGORICAL SUMMARY
    cat_cols = df.select_dtypes(include=['object','category','bool']).columns
    if len(cat_cols) > 0:
        print("[6] Categorical Columns:")
        for col in cat_cols:
            top = df[col].value_counts().head(3)
            print(f"  {col} ({df[col].nunique()} unique): {dict(top)}")
        print()

    # 7. SKEWNESS
    num_cols = df.select_dtypes(include='number').columns
    skew = df[num_cols].skew().sort_values(key=abs, ascending=False)
    print("[7] Skewness (|skew| > 1 = significantly skewed):")
    print(skew[abs(skew) > 0.5].round(2).to_string(), "\n")

    print("EDA complete.")

# Run it
run_eda(df, dataset_name="Sales Dataset 2024")
What to Do After This EDA

Once numerical EDA is complete, proceed to: (1) visual EDA — histograms, boxplots, and correlation heatmaps; (2) feature engineering — create new columns, bin continuous variables, encode categoricals; (3) data cleaning — handle nulls, fix types, remove duplicates.


Section 11

Golden Rules of EDA

🎯 7 Rules Every Data Scientist Should Follow
1
Always run .info() and .describe() immediately after loading any dataset — before writing a single line of transformation code.
2
Never assume a column's dtype is correct. A column named "price" stored as object is one of the most common bugs in data pipelines. Always validate with .dtypes.
3
Report both mean and median for every continuous variable. If they differ substantially, the distribution is skewed and the median is more meaningful.
4
Always check for missing values with .isnull().sum() before computing any aggregate. A single NaN propagates through calculations silently unless handled.
5
Use .sample(n) in addition to .head(). The first rows are rarely representative of the full dataset and can create a false sense of data quality.
6
Validate min and max values against domain knowledge. An age of 0, a negative price, or a 200-day delivery time signals a data quality issue — not an interesting insight.
7
Document every finding in your notebook with a markdown cell. EDA that exists only in your memory provides no value to teammates, reviewers, or your future self.