What Is Data Collection?
Data collection is the systematic process of gathering raw information from various sources so it can be stored, processed, and analysed. It is the very first — and arguably most critical — stage of any data science pipeline. No matter how powerful your model or how elegant your visualisation, the quality of your insights is bounded entirely by the quality of the data you collected.
A model trained on poorly collected or biased data will produce poor, biased predictions — even if every other step in the pipeline is done perfectly. Data collection decisions made at the start echo through the entire project.
Data can be collected from dozens of different sources using a variety of methods. The five most common in data science are: databases, APIs, flat files, web scraping, and curated dataset platforms like Kaggle.
Sources of Data
1. Databases
Databases are the most common data source in enterprise and production environments. Data is organised in structured tables and queried using SQL (Structured Query Language). Two major categories exist:
Most companies store transactional data (sales, user activity, logs) in relational databases. Data scientists query these to extract subsets for training models or building dashboards.
2. APIs (Application Programming Interfaces)
An API allows your code to request data from an external service over the internet. The service returns data — typically in JSON or XML format — which you then parse and store locally.
| API Example | Data Provided | Auth Required | Free Tier |
|---|---|---|---|
| Twitter / X API | Tweets, user data, trends | Yes | Limited |
| OpenWeatherMap | Temperature, humidity, forecasts | Yes | Yes |
| Alpha Vantage | Stock prices, financials | Yes | Yes |
| NASA Open APIs | Astronomical, Earth imagery | Yes | Yes |
| REST Countries | Country info, flags, geography | No | Yes |
Most APIs limit how many requests you can make per minute or per day. Use time.sleep() between requests and implement exponential backoff when you hit a 429 error. Always read the API's terms of service before scraping at scale.
3. Flat Files (CSV, Excel, JSON, Parquet)
Flat files are one of the simplest and most universal data formats. Data scientists frequently receive data dumps from clients, government portals, or research institutions in these formats.
| File Format | Best For | Python Library | Notes |
|---|---|---|---|
| .csv | Tabular data, universal exchange | pandas | Human-readable; large files are slow |
| .xlsx | Business reports with formatting | openpyxl | Supports multiple sheets |
| .json | Nested/hierarchical data, APIs | json / pandas | Flexible schema, verbose |
| .parquet | Big data, columnar storage | pyarrow | Compressed, fast reads, ideal for ML |
| .xml | Legacy systems, government data | lxml / xml.etree | Verbose, tree-based structure |
For datasets over 100 MB, switch from CSV to Parquet. Parquet reads 10–100× faster because it only loads the columns you need, and its built-in compression reduces file sizes by up to 80%. It is the standard format in big data pipelines (Spark, Databricks, AWS Glue).
4. Web Scraping
Web scraping involves programmatically extracting data directly from websites when no API exists. You download the raw HTML of a webpage and parse it to extract the information you need.
Always check a website's robots.txt file before scraping. Scraping personal data, copyrighted content, or violating a site's Terms of Service can lead to legal consequences. Prefer official APIs whenever they exist.
5. Kaggle & Open Dataset Platforms
For learning, prototyping, and competitions, curated dataset platforms provide ready-to-use, labelled data. Kaggle is the largest, but several other platforms serve specific domains.
| Platform | Domain Focus | Notable Feature |
|---|---|---|
| Kaggle | General ML, competitions | 1M+ datasets, free GPU notebooks |
| UCI ML Repository | Academic, classic ML datasets | Benchmark datasets (Iris, MNIST, etc.) |
| Hugging Face | NLP, images, audio | One-line dataset loading via datasets library |
| Google Dataset Search | Multi-domain | Search engine for publicly available datasets |
| data.gov / data.gov.in | Government data | Official public sector datasets |
Structured vs Unstructured Data
Not all data looks the same. One of the most important distinctions in data science is whether data is structured (organised and machine-readable) or unstructured (raw, without a predefined format). A third category — semi-structured — sits in between.
- Rows and columns with fixed schema
- Stored in relational databases
- Directly queryable with SQL
- Easy to analyse with pandas
- ~20% of all enterprise data
- Has some organisational properties
- No rigid fixed schema
- Tags or keys separate data fields
- Examples: JSON, XML, HTML, logs
- Parseable but requires transformation
- No predefined model or schema
- Stored in data lakes (S3, HDFS)
- Requires NLP, CV, or deep learning
- Examples: images, audio, video, text
- ~80% of all data generated today
Structured Data — Deep Dive
Structured data lives in a schema: every record has the same fields, the same data types, and the same constraints. This makes it easy to filter, aggregate, and join across tables.
| customer_id | name | age | purchase_amount | date |
|---|---|---|---|---|
| 1001 | Priya Sharma | 28 | ₹4,500 | 2024-03-12 |
| 1002 | Arjun Mehta | 35 | ₹12,200 | 2024-03-13 |
| 1003 | Deepa Nair | 42 | ₹8,750 | 2024-03-14 |
Each row is a record. Each column has a fixed type (integer, string, date). You can immediately run SELECT AVG(purchase_amount) FROM customers WHERE age > 30 — no preprocessing needed.
Structured data is ideal for regression, classification, time-series forecasting, and business intelligence. If your problem involves numeric or categorical features in a table, structured data is your friend.
Unstructured Data — Deep Dive
Unstructured data has no predefined format. A photograph, a customer review, a recorded call, or a PDF report — these cannot be loaded directly into a spreadsheet. They require specialised techniques to extract meaning.
| Data Type | Example | Technique Used | Python Library |
|---|---|---|---|
| Text | Product reviews, news articles | NLP, tokenisation, embeddings | spaCy, NLTK, Transformers |
| Image | X-rays, product photos, CCTV | Convolutional Neural Networks (CNNs) | OpenCV, Torchvision, PIL |
| Audio | Calls, music, speech | Spectrograms, speech-to-text | librosa, SpeechRecognition |
| Video | Surveillance, social media | Frame extraction + CNNs | OpenCV, ffmpeg |
| PDF / Documents | Contracts, reports | OCR, text extraction | pdfplumber, PyMuPDF |
Unstructured data must be transformed into structured form before most ML algorithms can use it. For text: word counts, TF-IDF vectors, or sentence embeddings. For images: pixel arrays or CNN feature maps. This transformation step is called feature extraction.
Structured vs Unstructured — Side-by-Side
| Property | Structured | Semi-Structured | Unstructured |
|---|---|---|---|
| Schema | Fixed | Flexible | None |
| Storage | RDBMS (SQL) | NoSQL / JSON stores | Data lakes, object storage |
| Query with SQL | Yes | Partially | No |
| ML-ready out of the box | Yes | Needs parsing | No |
| Volume in the real world | ~20% | ~10% | ~70–80% |
| Example | Sales database | API JSON response | Customer reviews, images |
| Typical tools | pandas, SQL | json, lxml, MongoDB | spaCy, OpenCV, Transformers |
Golden Rules of Data Collection
Data collection is not a one-time event — it is an ongoing, deliberate process. The best data scientists treat data collection with the same rigour they apply to modelling: questioning assumptions, validating sources, and continuously monitoring for drift or degradation in data quality.