Data Preparation / Data Preprocessing 📂 Data Collection · 1 of 13 16 min read

Data Collection in Data Science

A comprehensive guide to gathering raw data — where to find it, how to access it, and the key difference between structured and unstructured formats.

Section 01

What Is Data Collection?

Data collection is the systematic process of gathering raw information from various sources so it can be stored, processed, and analysed. It is the very first — and arguably most critical — stage of any data science pipeline. No matter how powerful your model or how elegant your visualisation, the quality of your insights is bounded entirely by the quality of the data you collected.

⚠️
Garbage In, Garbage Out

A model trained on poorly collected or biased data will produce poor, biased predictions — even if every other step in the pipeline is done perfectly. Data collection decisions made at the start echo through the entire project.

Data can be collected from dozens of different sources using a variety of methods. The five most common in data science are: databases, APIs, flat files, web scraping, and curated dataset platforms like Kaggle.


Section 02

Sources of Data

1. Databases

Databases are the most common data source in enterprise and production environments. Data is organised in structured tables and queried using SQL (Structured Query Language). Two major categories exist:

Relational (SQL)
SELECT * FROM orders
Tables with rows and columns. Examples: PostgreSQL, MySQL, SQLite, Microsoft SQL Server.
Non-Relational (NoSQL)
db.orders.find({})
Documents, key-value pairs, graphs. Examples: MongoDB, Redis, Cassandra, DynamoDB.
🧮 Querying a SQL Database — Example
Step 1
Connect to the database using a library such as psycopg2 (PostgreSQL) or sqlite3 (SQLite) in Python.
Step 2
Write a query: SELECT customer_id, revenue FROM sales WHERE year = 2024
Step 3
Load results directly into a pandas DataFrame with pd.read_sql(query, conn). Ready for analysis.
💡
Real-World Context

Most companies store transactional data (sales, user activity, logs) in relational databases. Data scientists query these to extract subsets for training models or building dashboards.


2. APIs (Application Programming Interfaces)

An API allows your code to request data from an external service over the internet. The service returns data — typically in JSON or XML format — which you then parse and store locally.

API Example Data Provided Auth Required Free Tier
Twitter / X API Tweets, user data, trends Yes Limited
OpenWeatherMap Temperature, humidity, forecasts Yes Yes
Alpha Vantage Stock prices, financials Yes Yes
NASA Open APIs Astronomical, Earth imagery Yes Yes
REST Countries Country info, flags, geography No Yes
🧮 Fetching Data from an API — Python Example
Step 1
Make an HTTP GET request: requests.get(url, params={'apikey': KEY})
Step 2
Check the status code. 200 = success. Anything else means an error occurred (401 = auth failed, 429 = rate limited).
Step 3
Parse the response: data = response.json() — converts the JSON string into a Python dictionary.
Step 4
Flatten nested JSON into a DataFrame: pd.json_normalize(data['results'])
📐
Pro Tip — Respect Rate Limits

Most APIs limit how many requests you can make per minute or per day. Use time.sleep() between requests and implement exponential backoff when you hit a 429 error. Always read the API's terms of service before scraping at scale.


3. Flat Files (CSV, Excel, JSON, Parquet)

Flat files are one of the simplest and most universal data formats. Data scientists frequently receive data dumps from clients, government portals, or research institutions in these formats.

File Format Best For Python Library Notes
.csv Tabular data, universal exchange pandas Human-readable; large files are slow
.xlsx Business reports with formatting openpyxl Supports multiple sheets
.json Nested/hierarchical data, APIs json / pandas Flexible schema, verbose
.parquet Big data, columnar storage pyarrow Compressed, fast reads, ideal for ML
.xml Legacy systems, government data lxml / xml.etree Verbose, tree-based structure
💡
When to Use Parquet

For datasets over 100 MB, switch from CSV to Parquet. Parquet reads 10–100× faster because it only loads the columns you need, and its built-in compression reduces file sizes by up to 80%. It is the standard format in big data pipelines (Spark, Databricks, AWS Glue).


4. Web Scraping

Web scraping involves programmatically extracting data directly from websites when no API exists. You download the raw HTML of a webpage and parse it to extract the information you need.

🧮 Web Scraping Workflow
Step 1
Fetch the page HTML using requests.get(url) or a headless browser like Playwright / Selenium for JavaScript-rendered pages.
Step 2
Parse the HTML with BeautifulSoup: soup = BeautifulSoup(html, 'html.parser')
Step 3
Locate elements using CSS selectors or tags: soup.select('table.data-table tr')
Step 4
Extract, clean, and store the data. Handle missing values and encoding issues. Save as .csv or directly to a database.
Step 5
Schedule automated runs with cron (Linux) or APScheduler (Python) to keep data fresh.
⚠️
Legal & Ethical Considerations

Always check a website's robots.txt file before scraping. Scraping personal data, copyrighted content, or violating a site's Terms of Service can lead to legal consequences. Prefer official APIs whenever they exist.


5. Kaggle & Open Dataset Platforms

For learning, prototyping, and competitions, curated dataset platforms provide ready-to-use, labelled data. Kaggle is the largest, but several other platforms serve specific domains.

Platform Domain Focus Notable Feature
Kaggle General ML, competitions 1M+ datasets, free GPU notebooks
UCI ML Repository Academic, classic ML datasets Benchmark datasets (Iris, MNIST, etc.)
Hugging Face NLP, images, audio One-line dataset loading via datasets library
Google Dataset Search Multi-domain Search engine for publicly available datasets
data.gov / data.gov.in Government data Official public sector datasets
🧮 Downloading a Kaggle Dataset via API
Step 1
Install the Kaggle CLI: pip install kaggle and place your kaggle.json API key in ~/.kaggle/
Step 2
Download: kaggle datasets download -d username/dataset-name
Step 3
Unzip and load: pd.read_csv('data.csv') — you're ready to explore.

Section 03

Structured vs Unstructured Data

Not all data looks the same. One of the most important distinctions in data science is whether data is structured (organised and machine-readable) or unstructured (raw, without a predefined format). A third category — semi-structured — sits in between.

Structured
Tables
  • Rows and columns with fixed schema
  • Stored in relational databases
  • Directly queryable with SQL
  • Easy to analyse with pandas
  • ~20% of all enterprise data
Semi-Structured
JSON / XML
  • Has some organisational properties
  • No rigid fixed schema
  • Tags or keys separate data fields
  • Examples: JSON, XML, HTML, logs
  • Parseable but requires transformation
Unstructured
Raw Media
  • No predefined model or schema
  • Stored in data lakes (S3, HDFS)
  • Requires NLP, CV, or deep learning
  • Examples: images, audio, video, text
  • ~80% of all data generated today

Structured Data — Deep Dive

Structured data lives in a schema: every record has the same fields, the same data types, and the same constraints. This makes it easy to filter, aggregate, and join across tables.

customer_id name age purchase_amount date
1001Priya Sharma28₹4,5002024-03-12
1002Arjun Mehta35₹12,2002024-03-13
1003Deepa Nair42₹8,7502024-03-14

Each row is a record. Each column has a fixed type (integer, string, date). You can immediately run SELECT AVG(purchase_amount) FROM customers WHERE age > 30 — no preprocessing needed.

When to Use Structured Data

Structured data is ideal for regression, classification, time-series forecasting, and business intelligence. If your problem involves numeric or categorical features in a table, structured data is your friend.


Unstructured Data — Deep Dive

Unstructured data has no predefined format. A photograph, a customer review, a recorded call, or a PDF report — these cannot be loaded directly into a spreadsheet. They require specialised techniques to extract meaning.

Data Type Example Technique Used Python Library
Text Product reviews, news articles NLP, tokenisation, embeddings spaCy, NLTK, Transformers
Image X-rays, product photos, CCTV Convolutional Neural Networks (CNNs) OpenCV, Torchvision, PIL
Audio Calls, music, speech Spectrograms, speech-to-text librosa, SpeechRecognition
Video Surveillance, social media Frame extraction + CNNs OpenCV, ffmpeg
PDF / Documents Contracts, reports OCR, text extraction pdfplumber, PyMuPDF
🎯
Bridging the Gap — Feature Engineering

Unstructured data must be transformed into structured form before most ML algorithms can use it. For text: word counts, TF-IDF vectors, or sentence embeddings. For images: pixel arrays or CNN feature maps. This transformation step is called feature extraction.


Section 04

Structured vs Unstructured — Side-by-Side

Property Structured Semi-Structured Unstructured
Schema Fixed Flexible None
Storage RDBMS (SQL) NoSQL / JSON stores Data lakes, object storage
Query with SQL Yes Partially No
ML-ready out of the box Yes Needs parsing No
Volume in the real world ~20% ~10% ~70–80%
Example Sales database API JSON response Customer reviews, images
Typical tools pandas, SQL json, lxml, MongoDB spaCy, OpenCV, Transformers

Section 05

Golden Rules of Data Collection

🎯 5 Rules Every Data Scientist Should Follow
1
Always check the source's reliability and recency. A dataset that is 5 years old may no longer reflect current patterns — especially in fast-moving domains like finance or social media.
2
Document exactly how you collected the data: timestamp, source URL, API version, query parameters. Reproducibility is essential — you must be able to re-collect the same data 6 months later.
3
Check for sampling bias. If you collect data only from one region, one demographic, or one time period, your model will not generalise to the full population.
4
Secure personal and sensitive data immediately. Apply anonymisation, encryption, and access controls before storing. Compliance with GDPR, DPDP (India), or HIPAA may be legally required.
5
Collect more data than you think you need. In practice, 30–50% of raw data gets discarded during cleaning. Starting with a larger, richer dataset gives you more flexibility downstream.
🧮
Key Takeaway

Data collection is not a one-time event — it is an ongoing, deliberate process. The best data scientists treat data collection with the same rigour they apply to modelling: questioning assumptions, validating sources, and continuously monitoring for drift or degradation in data quality.