Data Preparation / Data Preprocessing
Data Preparation (or Data Preprocessing) is the process of cleaning, transforming, and organizing raw data into a structured format suitable for training Machine Learning models.
Data Preparation is a crucial step in the Machine Learning pipeline that ensures raw data is transformed into a clean, consistent, and meaningful format before model training. Real-world data is often incomplete, noisy, and inconsistent, which can significantly impact model performance if not handled properly.
This stage involves multiple steps such as handling missing values, removing duplicates, correcting inconsistencies, and detecting outliers. It also includes transforming data through normalization or standardization, encoding categorical variables, and scaling numerical features.
Beyond cleaning, data preprocessing also focuses on feature engineering (creating new useful features) and feature selection (choosing the most relevant variables). Additionally, techniques like handling imbalanced datasets and splitting data into training and testing sets are essential parts of this phase.
A well-executed data preparation process improves model accuracy, reduces bias, and ensures that the Machine Learning model generalizes well to unseen data. In practice, this step often takes 60–80% of the total project time, making it one of the most important stages in any data science workflow.
What You'll Learn
Data Collection
Data Collection
Data Collection in Data Science
A comprehensive guide to gathering raw data — where to find it, how to access it, and the key diffe…
16 minExploratory Data Analysis (EDA) with Pandas
A hands-on guide to reading datasets, understanding their structure, data types, and computing esse…
31 minData Visualization & Pattern Detection
A comprehensive guide to visualising data using Python's most powerful libraries — matplotlib, seab…
32 minMatplotlib in Python
A hands-on guide to Python's most powerful plotting library — covering line charts, bar charts, sca…
47 minData Cleaning in Python — Handling Missing Values & Removing Duplicates
A story-driven, comprehensive guide to cleaning real-world messy data using pandas — covering missi…
46 minFixing Inconsistent Data & Outlier Detection in Python
A story-driven, hands-on guide to identifying and fixing inconsistent data formats, standardising c…
45 minData Transformation — Normalisation & Standardisation in Python
A story-driven, hands-on guide to transforming raw features into model-ready form using normalisati…
51 minEncoding Categorical Variables
A story-driven, comprehensive guide to converting categorical text data into numeric form using Lab…
49 minScaling in Machine Learning
A story-driven, hands-on guide to feature scaling techniques — Min-Max Normalisation, Z-Score Stand…
52 minFeature Selection in Machine Learning
A story-driven, comprehensive guide to selecting the most relevant features for machine learning mo…
64 minFeature Engineering & Feature Scaling
A story-driven, comprehensive guide to creating powerful new features from raw data and scaling the…
68 minHandling Imbalanced Data in Machine Learning
A story-driven, comprehensive guide to detecting and treating class imbalance — covering oversampli…
45 minData Splitting Mastery
A comprehensive, story-driven guide to the art and science of splitting datasets fo…
66 min