1 – Yun.Bun

Classical Datasets

The datasets below are iconic to academic research, industry, and model development, across machine learning domains. In a previous post, we introduced the Raspberry Pi, a versatile, affordable, and compact single-board computer suitable for a wide range of tasks. Now, let’s explore how it can be used to develop, train, and test machine learning models on classical datasets!

General Classification / Machine Learning
Natural Language Processing
Image / Vision
Time Series / Sensor Data
Finance / Economics
Healthcare / Biology

1 – GC / ML

∘ Iris Dataset
A classic dataset in machine learning, often used for classification tasks. It consists of 150 samples from 3 species of iris flowers, with 4 features (sepal length, sepal width, petal length, and petal width).
∘ Wine Dataset
This dataset is used for classification and consists of 13 chemical features measured from wine samples belonging to three different classes (cultivars of wine).
∘ Breast Cancer Wisconsin (Diagnostic) Dataset
A popular dataset used for binary classification tasks (malignant vs benign). It includes 30 features related to cell nucleus measurements from breast cancer biopsies.
∘ MNIST (Modified National Institute of Standards and Technology) Database
Contains 70,000 28×28 grayscale images of handwritten digits (0-9) and is widely used for training image classification models.

Categories: Classification, Image Recognition

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
  <title>Digit Verifier</title>
  <style>
    body { 
      font-family: Arial, sans-serif; 
      text-align: center; 
      padding: 30px; 
      background-color: #f9f9f9;
    }
# Want access? Visit Yun.Bun I/O!

2 – NLP

∘ IMDB Movie Reviews
A widely used dataset for sentiment analysis, containing 50,000 movie reviews labeled as positive or negative.
∘ 20 Newsgroups
A dataset for text classification, containing 20 different newsgroups (topics) like sports, politics, and technology. It’s often used for multi-class text classification tasks.
∘ SQuAD (Stanford Question Answering Dataset)
A popular dataset used for machine reading comprehension, where models are asked to answer questions based on a passage of text.
∘ Penn Treebank (PTB)
A widely used corpus for syntactic and semantic tasks, including parsing and part-of-speech tagging.

Categories: Text Analysis, Speech & Language Processing

Example: Building a simple voice assistant on a Raspberry Pi using NLP techniques like intent classification.

No Raspberry Pi? – Regular Laptop or Desktop, Arduino or Cloud Platforms.

3 – CV

∘ CIFAR-10 and CIFAR-100
Both datasets consist of 60,000 32×32 color images, where CIFAR-10 contains 10 classes and CIFAR-100 contains 100 classes. These are often used for object classification tasks.
∘ ImageNet
A large-scale image dataset used for object recognition. It contains millions of labeled images in over 1,000 categories and is widely used in deep learning and computer vision.
∘ COCO (Common Objects in Context)
A large-scale image dataset for object detection, segmentation, and captioning. It contains over 300,000 images with multiple objects in each and provides labels for them.
∘ MNIST
Mentioned above, it is also a classic dataset for digit recognition in computer vision.

Categories: Computer Vision, Edge AI

An example: Using a Raspberry Pi with OpenCV and a camera to create a smart security system that detects movement and classifies objects (people, animals, etc.).

No Raspberry Pi? – Old Smartphone, Regular Laptop or Desktop or Jetson Nano.

4 – TS. and SD

∘ UCI Machine Learning Repository (various)
The UCI repository has a number of time series datasets, including those for sensor data, stock prices, and more. Notable ones include the “Air Quality” dataset (air pollution sensor data) and the “Electricity Load Diagrams” dataset.
∘ ECG (Electrocardiogram) Dataset
A dataset used for heart disease classification, including electrocardiogram signals, which are time series data.
∘ Yahoo Finance Stock Data
Stock price data available for many companies, useful for time series forecasting, trend analysis, and anomaly detection.
∘ Kaggle’s Bike Sharing Dataset
This dataset contains time series data of bike-sharing systems in Washington, D.C. It includes features like temperature, humidity, and number of bike rentals.

Categories: Regression/Forecasting, Anomalies

An example: Setting up a Raspberry Pi with environmental sensors to monitor temperature and humidity and then use that data for time series analysis and forecasting.

No Raspberry Pi? – Arduino with a USB Cable, Smartphone as a Sensor or Desktop/Laptop with Temperature/Humidity Sensors.

5 – Fin. & Econ.

∘ Stock Market Data (Yahoo Finance / Quandl)
Stock price history and market data are used for forecasting financial trends, predicting stock prices, and other economic analyses.
∘ Fama/French Data
A well-known dataset in financial research, providing returns of various stock portfolios and factors such as size, value, and momentum.
∘ Bitcoin Historical Data
A dataset containing the price of Bitcoin over time, often used for cryptocurrency price prediction and financial forecasting.
∘ World Bank Economic Indicators
Provides global economic data such as GDP, inflation rates, unemployment, and other economic indicators for multiple countries.

Categories: Financial Time Series, Economic Analysis

An example: Using the Raspberry Pi to track live stock prices and generate alerts when certain conditions are met (e.g., price drops or rises by a certain percentage).

No Raspberry Pi? – Python on Your PC or Laptop, Cloud-Based Services or Mobile Apps with Alert Features.

6 – H.C. and Biol.

∘ MIMIC-III (Medical Information Mart for Intensive Care)
A large, publicly available critical care database that includes patient data such as demographics, vital signs, medications, lab results, and diagnoses.
∘ Diabetes Dataset
A commonly used dataset for predicting whether a patient has diabetes based on features like age, BMI, blood pressure, and more.
∘ Breast Cancer Wisconsin (Diagnostic) Dataset
Also used for machine learning, as mentioned above, this dataset is often used in medical applications for diagnosing breast cancer.
∘ Gene Expression Omnibus (GEO)
A large-scale repository of gene expression data. The datasets in GEO are used for biological and medical research, such as cancer genomics and other diseases.

Categories: Clinical and Predictive health

An example: Creating a health monitoring system where a Raspberry Pi collects ECG data from a sensor and performs real-time anomaly detection to detect potential arrhythmias.

No Raspberry Pi? – Microcontroller with Wi-Fi/Bluetooth, Computer (Laptop/PC) with a Sensor Interface or Mobile Phone with ECG Sensor.

Tech finds added to shopping list! For more, visit Yun.Bun I/O.

2 responses to “1”

Michelle

May 3, 2025

What an interesting read this was! To be honest, I knew nothing about datasets prior to this article, so I definitely learned a thing or two.

LikeLike

Reply
Kimberley A.

May 4, 2025

Your curated list of classical datasets is incredibly helpful! I appreciate how you’ve organized them by domain—it’s a great resource for anyone diving into machine learning projects.

LikeLike

Reply

1

2 responses to “1”

2 responses to “1”

Leave a comment Cancel reply