What Is Exploratory Data Analysis (EDA)?
Exploratory Data Analysis is the critical first step every data scientist takes before building any model or drawing any conclusion. Before you can trust your data, you must understand it — its shape, its quirks, its missing pieces, and the relationships hiding inside it.
EDA is not a single function or a checklist. It is a mindset: ask questions, look at the data from multiple angles, let the data surprise you. The statistician John Tukey, who coined the term in 1977, described it as "detective work" — you are not yet proving anything, you are forming hypotheses.
Studies show that data scientists spend 60–80% of their time on data understanding and preparation — not on modelling. A model built without EDA risks learning from corrupted, imbalanced, or misunderstood data. EDA is the foundation on which every valid insight is built.
A complete EDA workflow covers three pillars: reading the dataset, understanding its structure and data types, and computing summary statistics. This tutorial walks you through all three with real pandas code.
The EDA Pipeline — Step by Step
Think of EDA as a structured investigation. Each step answers a different question about your data. Here is the pipeline we will follow in this tutorial:
pd.read_*() functions..shape, .columns, .dtypes, and .info()..head(), .tail(), and .sample() to spot obvious issues — wrong formats, unexpected values, junk rows..isnull().sum(). Decide whether to drop, fill, or flag missing values..describe() and individual aggregation methods.Steps 01–05 form the numerical foundation of EDA. Visualisation (Step 06) is covered in a separate tutorial on matplotlib and seaborn.
Reading the Dataset
Setting Up Your Environment
All EDA in Python begins with importing pandas. NumPy is almost always imported alongside it for numerical operations.
import pandas as pd
import numpy as np
# Optional: make pandas display more columns / rows
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', '{:.2f}'.format)
Reading Different File Formats
Pandas provides a unified API for loading almost any file format. The result is always the same: a DataFrame — a two-dimensional, labelled table.
| File Format | Function | Key Parameters |
|---|---|---|
| CSV | pd.read_csv() |
sep, header, index_col, parse_dates |
| Excel (.xlsx) | pd.read_excel() |
sheet_name, skiprows, usecols |
| JSON | pd.read_json() |
orient, lines, dtype |
| SQL Database | pd.read_sql() |
query, con (connection object) |
| Parquet | pd.read_parquet() |
columns, engine ('pyarrow' / 'fastparquet') |
| HTML Table | pd.read_html() |
match, header, index_col |
Practical Read Examples
# ── Reading a CSV file ──────────────────────────────────
df = pd.read_csv('sales_data.csv')
# CSV with a custom separator (e.g. tab-delimited)
df = pd.read_csv('data.tsv', sep='\t')
# Parse date columns automatically
df = pd.read_csv('orders.csv', parse_dates=['order_date', 'ship_date'])
# Load only specific columns (faster for large files)
df = pd.read_csv('big_file.csv', usecols=['age', 'income', 'target'])
# Read in chunks (for files too large for RAM)
chunks = []
for chunk in pd.read_csv('huge.csv', chunksize=10000):
chunks.append(chunk)
df = pd.concat(chunks, ignore_index=True)
# ── Reading from Excel ──────────────────────────────────
df = pd.read_excel('report.xlsx', sheet_name='Q3_Sales')
# ── Reading from a SQL database (SQLite example) ────────
import sqlite3
conn = sqlite3.connect('mydb.db')
df = pd.read_sql('SELECT * FROM customers WHERE country = "India"', conn)
conn.close()
# ── Reading a Parquet file ──────────────────────────────
df = pd.read_parquet('data.parquet')
Files with non-ASCII characters (Hindi, Arabic, accented European letters) may throw a UnicodeDecodeError. Add encoding='utf-8' or try encoding='latin-1' as a fallback: pd.read_csv('file.csv', encoding='utf-8').
Understanding Dataset Structure
Once data is loaded, the first task is orientation: how big is this? What are the columns? Are the types what I expect? The following five commands form the standard inspection toolkit.
Shape, Columns & Basic Preview
# How many rows and columns?
print(df.shape) # (rows, columns)
# What are the column names?
print(df.columns.tolist())
# First 5 rows
df.head()
# Last 5 rows
df.tail()
# 5 random rows — great for spotting patterns without cherry-picking
df.sample(5, random_state=42)
.head() always shows the first rows — which are often sorted, clean, or representative of only one category. .sample() gives you a random cross-section, which is much more likely to reveal issues like mixed data types, stray characters, or outlier values buried in the middle of the dataset.
The Master Inspection Command: .info()
df.info() is the single most informative summary you can get about a DataFrame. It shows column names, data types, non-null counts, and memory usage — all in one call.
df.info()
object where you expect datetime64? Is a numeric column stored as object? Fix with pd.to_datetime() or pd.to_numeric().object use far more RAM than category. Converting low-cardinality columns can cut memory usage by up to 90%.Data Types in Pandas
Every column in a DataFrame has a dtype — a data type that determines how values are stored, how much memory they use, and what operations are valid on them. Understanding and correcting dtypes is one of the most impactful steps in EDA.
Checking and Fixing Data Types
# View the dtype of every column
print(df.dtypes)
# ── TYPE CONVERSIONS ────────────────────────────────────
# Convert object → datetime
df['order_date'] = pd.to_datetime(df['order_date'])
# Convert object → numeric (coerce turns unparseable values to NaN)
df['price'] = pd.to_numeric(df['price'], errors='coerce')
# Convert object → category (saves memory for low-cardinality columns)
df['gender'] = df['gender'].astype('category')
# Convert float → int (only safe if no NaN values remain)
df['delivery_days'] = df['delivery_days'].astype('int32')
# ── DATETIME FEATURE EXTRACTION ─────────────────────────
df['order_year'] = df['order_date'].dt.year
df['order_month'] = df['order_date'].dt.month
df['day_of_week'] = df['order_date'].dt.day_name()
If a column has fewer than 50 unique values, convert it to category. If an integer column fits within 32-bit range, use int32 instead of int64. These two changes alone can halve the RAM usage of large DataFrames.
Checking Unique Values and Cardinality
# How many unique values does each column have?
df.nunique()
# Frequency count with proportions
df['product_category'].value_counts(normalize=True)
# Check for duplicated rows
print(f"Duplicate rows: {df.duplicated().sum()}")
# Drop duplicates
df = df.drop_duplicates()
Identifying Missing Data
Missing values are one of the most common data quality problems. Pandas represents them as NaN (Not a Number) for numeric columns and NaT for datetime columns. Before computing any statistics, you must understand the extent and pattern of missingness.
# Total null count per column
df.isnull().sum()
# Percentage of nulls per column
null_pct = (df.isnull().sum() / len(df) * 100).round(2)
print(null_pct[null_pct > 0]) # show only columns with nulls
| Missing % Range | Recommended Strategy | Pandas Method |
|---|---|---|
| < 5% | Fill with mean / median / mode, or drop rows | .fillna(df['col'].median()) |
| 5% – 20% | Impute using model-based methods (KNN, regression) | sklearn.impute.KNNImputer |
| 20% – 50% | Add a missing indicator column, then impute | df['col_missing'] = df['col'].isnull().astype(int) |
| > 50% | Consider dropping the column entirely | df.drop(columns=['col']) |
Summary Statistics
Summary statistics condense an entire column of data into a handful of numbers that describe its centre, spread, and shape. They are the numerical backbone of EDA — before you plot anything, these numbers tell you what to expect.
The Six Core Statistics
The Formulas
Pandas uses sample statistics by default (ddof=1, divides by n−1). This is almost always correct — you are working with a sample, not the entire population. To get population statistics, pass ddof=0: df['col'].var(ddof=0).
Computing Statistics with Pandas
# ── INDIVIDUAL STATISTICS ────────────────────────────────
df['purchase_amount'].mean()
df['purchase_amount'].median()
df['purchase_amount'].mode()[0]
df['purchase_amount'].var()
df['purchase_amount'].std()
df['purchase_amount'].min()
df['purchase_amount'].max()
# ── PERCENTILES & IQR ───────────────────────────────────
Q1 = df['purchase_amount'].quantile(0.25)
Q3 = df['purchase_amount'].quantile(0.75)
IQR = Q3 - Q1
# ── CUSTOM AGGREGATION ──────────────────────────────────
df['purchase_amount'].agg(['mean', 'median', 'std', 'min', 'max'])
# Multiple columns, multiple stats
df[['purchase_amount', 'rating', 'delivery_days']].agg({
'purchase_amount': ['mean', 'median', 'std'],
'rating': ['mean', 'min', 'max'],
'delivery_days': ['mean', 'median']
})
The All-in-One Command: .describe()
# Numeric columns only (default)
df.describe()
# All columns including categoricals
df.describe(include='all')
# Add custom percentiles
df.describe(percentiles=[0.05, 0.25, 0.50, 0.75, 0.95])
Grouped & Conditional Statistics
Overall statistics rarely tell the full story. The most powerful EDA insights come from computing statistics by group. Pandas' .groupby() makes this effortless.
# Mean purchase amount by product category
df.groupby('product_category')['purchase_amount'].mean().sort_values(ascending=False)
# Multiple stats by gender
df.groupby('gender')['purchase_amount'].agg(['mean', 'median', 'std', 'count'])
# Group by multiple columns
df.groupby(['gender', 'product_category'])['purchase_amount'].mean().unstack()
# Correlation matrix — all numeric columns
df.corr(numeric_only=True).round(2)
When the mean is substantially higher than the median, data is right-skewed. Common in income, purchase amounts, and website traffic. In such cases, the median is a more honest description of the typical value. Use df['col'].skew() to quantify: values above +1 or below −1 indicate significant skew.
Detecting Outliers
An outlier is a value that lies far from the bulk of the data. Outliers can be legitimate extreme values, data entry errors, or measurement errors. EDA must identify them before modelling.
The IQR Method (Robust, Non-parametric)
def detect_outliers_iqr(series):
Q1 = series.quantile(0.25)
Q3 = series.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = series[(series < lower) | (series > upper)]
print(f"Lower fence : {lower:.2f}")
print(f"Upper fence : {upper:.2f}")
print(f"Outliers : {len(outliers)} ({len(outliers)/len(series)*100:.1f}%)")
return outliers
outliers = detect_outliers_iqr(df['purchase_amount'])
The Z-Score Method (Assumes Normality)
from scipy import stats
z_scores = stats.zscore(df['purchase_amount'].dropna())
outliers_z = df[abs(z_scores) > 3]
print(f"Z-score outliers: {len(outliers_z)}")
| Method | Threshold | Assumes Normality | Best For |
|---|---|---|---|
| IQR Method | 1.5 × IQR from Q1/Q3 | No | Skewed data, general use |
| Z-Score | |z| > 3 | Yes | Roughly normal distributions |
| Modified Z-Score | |MAD score| > 3.5 | No | Small datasets, robust |
| Isolation Forest | Anomaly score < threshold | No | Multivariate outliers |
Complete EDA Template — Copy & Use
Paste this function into your notebook and replace df with your dataset. It runs a full numerical EDA in one call.
import pandas as pd
import numpy as np
def run_eda(df, dataset_name="Dataset"):
print(f"\n{'='*55}")
print(f" EDA REPORT: {dataset_name}")
print(f"{'='*55}\n")
# 1. SHAPE
print(f"[1] Shape: {df.shape[0]:,} rows × {df.shape[1]} columns\n")
# 2. DATA TYPES
print("[2] Data Types:")
print(df.dtypes.to_string(), "\n")
# 3. MISSING VALUES
null_df = pd.DataFrame({
'Missing': df.isnull().sum(),
'%': (df.isnull().mean() * 100).round(2)
})
null_df = null_df[null_df['Missing'] > 0]
print("[3] Missing Values:")
print(null_df.to_string() if len(null_df) > 0 else "None — clean!", "\n")
# 4. DUPLICATES
print(f"[4] Duplicated rows: {df.duplicated().sum():,}\n")
# 5. SUMMARY STATISTICS
print("[5] Summary Statistics (numeric):")
print(df.describe(percentiles=[.05,.25,.50,.75,.95]).round(2).to_string(), "\n")
# 6. CATEGORICAL SUMMARY
cat_cols = df.select_dtypes(include=['object','category','bool']).columns
if len(cat_cols) > 0:
print("[6] Categorical Columns:")
for col in cat_cols:
top = df[col].value_counts().head(3)
print(f" {col} ({df[col].nunique()} unique): {dict(top)}")
print()
# 7. SKEWNESS
num_cols = df.select_dtypes(include='number').columns
skew = df[num_cols].skew().sort_values(key=abs, ascending=False)
print("[7] Skewness (|skew| > 1 = significantly skewed):")
print(skew[abs(skew) > 0.5].round(2).to_string(), "\n")
print("EDA complete.")
# Run it
run_eda(df, dataset_name="Sales Dataset 2024")
Once numerical EDA is complete, proceed to: (1) visual EDA — histograms, boxplots, and correlation heatmaps; (2) feature engineering — create new columns, bin continuous variables, encode categoricals; (3) data cleaning — handle nulls, fix types, remove duplicates.
Golden Rules of EDA
.info() and .describe() immediately after loading any dataset — before writing a single line of transformation code.object is one of the most common bugs in data pipelines. Always validate with .dtypes..isnull().sum() before computing any aggregate. A single NaN propagates through calculations silently unless handled..sample(n) in addition to .head(). The first rows are rarely representative of the full dataset and can create a false sense of data quality.