Data Collection in Data Science

Section 01

What Is Data Collection?

Data collection is the systematic process of gathering raw information from various sources so it can be stored, processed, and analysed. It is the very first — and arguably most critical — stage of any data science pipeline. No matter how powerful your model or how elegant your visualisation, the quality of your insights is bounded entirely by the quality of the data you collected.

⚠️

Garbage In, Garbage Out

A model trained on poorly collected or biased data will produce poor, biased predictions — even if every other step in the pipeline is done perfectly. Data collection decisions made at the start echo through the entire project.

Data can be collected from dozens of different sources using a variety of methods. The five most common in data science are: databases, APIs, flat files, web scraping, and curated dataset platforms like Kaggle.

Section 02

Sources of Data

1. Databases

Databases are the most common data source in enterprise and production environments. Data is organised in structured tables and queried using SQL (Structured Query Language). Two major categories exist:

Relational (SQL)

SELECT * FROM orders

Tables with rows and columns. Examples: PostgreSQL, MySQL, SQLite, Microsoft SQL Server.

Non-Relational (NoSQL)

db.orders.find({})

Documents, key-value pairs, graphs. Examples: MongoDB, Redis, Cassandra, DynamoDB.

🧮 Querying a SQL Database — Example

Step 1

Connect to the database using a library such as psycopg2 (PostgreSQL) or sqlite3 (SQLite) in Python.

Step 2

Write a query: SELECT customer_id, revenue FROM sales WHERE year = 2024

Step 3

Load results directly into a pandas DataFrame with pd.read_sql(query, conn). Ready for analysis.

💡

Real-World Context

Most companies store transactional data (sales, user activity, logs) in relational databases. Data scientists query these to extract subsets for training models or building dashboards.

2. APIs (Application Programming Interfaces)

An API allows your code to request data from an external service over the internet. The service returns data — typically in JSON or XML format — which you then parse and store locally.

API Example	Data Provided	Auth Required	Free Tier
Twitter / X API	Tweets, user data, trends	Yes	Limited
OpenWeatherMap	Temperature, humidity, forecasts	Yes	Yes
Alpha Vantage	Stock prices, financials	Yes	Yes
NASA Open APIs	Astronomical, Earth imagery	Yes	Yes
REST Countries	Country info, flags, geography	No	Yes

🧮 Fetching Data from an API — Python Example

Step 1

Make an HTTP GET request: requests.get(url, params={'apikey': KEY})

Step 2

Check the status code. 200 = success. Anything else means an error occurred (401 = auth failed, 429 = rate limited).

Step 3

Parse the response: data = response.json() — converts the JSON string into a Python dictionary.

Step 4

Flatten nested JSON into a DataFrame: pd.json_normalize(data['results'])

📐

Pro Tip — Respect Rate Limits

Most APIs limit how many requests you can make per minute or per day. Use time.sleep() between requests and implement exponential backoff when you hit a 429 error. Always read the API's terms of service before scraping at scale.

3. Flat Files (CSV, Excel, JSON, Parquet)

Flat files are one of the simplest and most universal data formats. Data scientists frequently receive data dumps from clients, government portals, or research institutions in these formats.

File Format	Best For	Python Library	Notes
.csv	Tabular data, universal exchange	pandas	Human-readable; large files are slow
.xlsx	Business reports with formatting	openpyxl	Supports multiple sheets
.json	Nested/hierarchical data, APIs	json / pandas	Flexible schema, verbose
.parquet	Big data, columnar storage	pyarrow	Compressed, fast reads, ideal for ML
.xml	Legacy systems, government data	lxml / xml.etree	Verbose, tree-based structure

💡

When to Use Parquet

For datasets over 100 MB, switch from CSV to Parquet. Parquet reads 10–100× faster because it only loads the columns you need, and its built-in compression reduces file sizes by up to 80%. It is the standard format in big data pipelines (Spark, Databricks, AWS Glue).

4. Web Scraping

Web scraping involves programmatically extracting data directly from websites when no API exists. You download the raw HTML of a webpage and parse it to extract the information you need.

🧮 Web Scraping Workflow

Step 1

Fetch the page HTML using requests.get(url) or a headless browser like Playwright / Selenium for JavaScript-rendered pages.

Step 2

Parse the HTML with BeautifulSoup: soup = BeautifulSoup(html, 'html.parser')

Step 3

Locate elements using CSS selectors or tags: soup.select('table.data-table tr')

Step 4

Extract, clean, and store the data. Handle missing values and encoding issues. Save as .csv or directly to a database.

Step 5

Schedule automated runs with cron (Linux) or APScheduler (Python) to keep data fresh.

⚠️

Legal & Ethical Considerations

Always check a website's robots.txt file before scraping. Scraping personal data, copyrighted content, or violating a site's Terms of Service can lead to legal consequences. Prefer official APIs whenever they exist.

5. Kaggle & Open Dataset Platforms

For learning, prototyping, and competitions, curated dataset platforms provide ready-to-use, labelled data. Kaggle is the largest, but several other platforms serve specific domains.

Platform	Domain Focus	Notable Feature
Kaggle	General ML, competitions	1M+ datasets, free GPU notebooks
UCI ML Repository	Academic, classic ML datasets	Benchmark datasets (Iris, MNIST, etc.)
Hugging Face	NLP, images, audio	One-line dataset loading via datasets library
Google Dataset Search	Multi-domain	Search engine for publicly available datasets
data.gov / data.gov.in	Government data	Official public sector datasets

🧮 Downloading a Kaggle Dataset via API

Step 1

Install the Kaggle CLI: pip install kaggle and place your kaggle.json API key in ~/.kaggle/

Step 2

Download: kaggle datasets download -d username/dataset-name

Step 3

Unzip and load: pd.read_csv('data.csv') — you're ready to explore.

Section 03

Structured vs Unstructured Data

Not all data looks the same. One of the most important distinctions in data science is whether data is structured (organised and machine-readable) or unstructured (raw, without a predefined format). A third category — semi-structured — sits in between.

Structured

Tables

Rows and columns with fixed schema
Stored in relational databases
Directly queryable with SQL
Easy to analyse with pandas
~20% of all enterprise data

Semi-Structured

JSON / XML

Has some organisational properties
No rigid fixed schema
Tags or keys separate data fields
Examples: JSON, XML, HTML, logs
Parseable but requires transformation

Unstructured

Raw Media

No predefined model or schema
Stored in data lakes (S3, HDFS)
Requires NLP, CV, or deep learning
Examples: images, audio, video, text
~80% of all data generated today

Structured Data — Deep Dive

Structured data lives in a schema: every record has the same fields, the same data types, and the same constraints. This makes it easy to filter, aggregate, and join across tables.

customer_id	name	age	purchase_amount	date
1001	Priya Sharma	28	₹4,500	2024-03-12
1002	Arjun Mehta	35	₹12,200	2024-03-13
1003	Deepa Nair	42	₹8,750	2024-03-14

Each row is a record. Each column has a fixed type (integer, string, date). You can immediately run SELECT AVG(purchase_amount) FROM customers WHERE age > 30 — no preprocessing needed.

✅

When to Use Structured Data

Structured data is ideal for regression, classification, time-series forecasting, and business intelligence. If your problem involves numeric or categorical features in a table, structured data is your friend.

Unstructured Data — Deep Dive

Unstructured data has no predefined format. A photograph, a customer review, a recorded call, or a PDF report — these cannot be loaded directly into a spreadsheet. They require specialised techniques to extract meaning.

Data Type	Example	Technique Used	Python Library
Text	Product reviews, news articles	NLP, tokenisation, embeddings	spaCy, NLTK, Transformers
Image	X-rays, product photos, CCTV	Convolutional Neural Networks (CNNs)	OpenCV, Torchvision, PIL
Audio	Calls, music, speech	Spectrograms, speech-to-text	librosa, SpeechRecognition
Video	Surveillance, social media	Frame extraction + CNNs	OpenCV, ffmpeg
PDF / Documents	Contracts, reports	OCR, text extraction	pdfplumber, PyMuPDF

🎯

Bridging the Gap — Feature Engineering

Unstructured data must be transformed into structured form before most ML algorithms can use it. For text: word counts, TF-IDF vectors, or sentence embeddings. For images: pixel arrays or CNN feature maps. This transformation step is called feature extraction.

Section 04

Structured vs Unstructured — Side-by-Side

Property	Structured	Semi-Structured	Unstructured
Schema	Fixed	Flexible	None
Storage	RDBMS (SQL)	NoSQL / JSON stores	Data lakes, object storage
Query with SQL	Yes	Partially	No
ML-ready out of the box	Yes	Needs parsing	No
Volume in the real world	~20%	~10%	~70–80%
Example	Sales database	API JSON response	Customer reviews, images
Typical tools	pandas, SQL	json, lxml, MongoDB	spaCy, OpenCV, Transformers

Section 05

Golden Rules of Data Collection

🎯 5 Rules Every Data Scientist Should Follow

Always check the source's reliability and recency. A dataset that is 5 years old may no longer reflect current patterns — especially in fast-moving domains like finance or social media.

Document exactly how you collected the data: timestamp, source URL, API version, query parameters. Reproducibility is essential — you must be able to re-collect the same data 6 months later.

Check for sampling bias. If you collect data only from one region, one demographic, or one time period, your model will not generalise to the full population.

Secure personal and sensitive data immediately. Apply anonymisation, encryption, and access controls before storing. Compliance with GDPR, DPDP (India), or HIPAA may be legally required.

Collect more data than you think you need. In practice, 30–50% of raw data gets discarded during cleaning. Starting with a larger, richer dataset gives you more flexibility downstream.

🧮

Key Takeaway

Data collection is not a one-time event — it is an ongoing, deliberate process. The best data scientists treat data collection with the same rigour they apply to modelling: questioning assumptions, validating sources, and continuously monitoring for drift or degradation in data quality.