deepbox/datasets

Built-in Datasets

24 deterministic, synthetic datasets bundled directly inside the package. No downloads, no network access, no external files. Every loader returns a Dataset object with typed Tensor data, target labels, feature names, and a description. All data is generated from seeded RNGs so calls are always reproducible.

Dataset

type Dataset = {  data: Tensor;              // Feature matrix [nSamples, nFeatures]  target: Tensor;            // Target vector [nSamples] or matrix [nSamples, nTargets]  featureNames: string[];    // Human-readable column labels  targetNames?: string[];    // Class labels (classification) or target labels (regression)  description: string;       // One-line summary of the dataset};

Classic Benchmark Datasets

loadIris() — 150 samples · 4 features (sepal/petal length & width) · 3 classes (setosa, versicolor, virginica) · The most widely used dataset in ML education. Tests multiclass classification with overlapping class boundaries between versicolor and virginica.
loadDigits() — 1797 samples · 64 features (8×8 pixel intensities 0–15) · 10 classes (digits 0–9) · A lightweight alternative to MNIST for testing image classifiers, dimensionality reduction, and clustering without large download sizes.
loadBreastCancer() — 569 samples (212 malignant, 357 benign) · 30 features (mean, error, and worst of radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension) · Binary classification benchmark for medical diagnostics.
loadDiabetes() — 442 samples · 10 features (age, sex, BMI, blood pressure, s1–s6 blood serum measurements) · Continuous target (disease progression after 1 year) · Standard regression benchmark.
loadLinnerud() — 20 samples · 3 exercise features (chins, situps, jumps) → 3 physiological targets (weight, waist, pulse) · Multi-output regression with very few samples, useful for testing multivariate models.

Domain-Specific Datasets

loadFlowersExtended() — 180 samples · 6 features (sepal/petal length & width, color intensity, stem thickness) · 4 species (setosa, versicolor, virginica, chrysantha) · Extended Iris with additional morphological features and a 4th class.
loadLeafShapes() — 150 samples · 8 geometric features (area, perimeter, aspect ratio, curvature, compactness, lobedness, elongation, solidity) · 5 species (maple, oak, birch, willow, ginkgo) · Multi-class classification from leaf geometry.
loadFruitQuality() — 150 samples · 5 features (weight, sugar content in Brix, acidity pH, firmness in Newtons, color score) · 3 classes (apple, orange, banana) · Food quality classification from measurable physical properties.
loadSeedMorphology() — 150 samples · 4 features (length, width, roundness, density) · 3 seed types (wheat, rice, sunflower) · Classification from seed morphometric measurements.
loadSensorStates() — 180 samples · 6 sensor readings (temperature, pressure, humidity, voltage, vibration, current) · 3 operating modes (normal, heating, fault) · Industrial IoT anomaly detection dataset.
loadStudentPerformance() — 150 samples · 3 integer features (study hours/week, absences, quiz score 0–100) · 3 outcome classes (fail, pass, excellent) · Educational outcome prediction.
loadTrafficConditions() — 150 samples · 3 features (time of day, speed km/h, density vehicles/km) · 3 classes (light, moderate, heavy) · Traffic flow classification.

Regression Datasets

loadHousingMini() — 200 samples · 4 features (size sqm, rooms, age years, distance to center km) · Target: price in thousands · Regression with interpretable linear relationships.
loadPlantGrowth() — 200 samples · 3 features (sunlight hours/day, water mL/day, soil quality 0–10) · Target: height in cm after 30 days · Agricultural regression with clear causal structure.
loadEnergyEfficiency() — 200 samples · 3 features (insulation R-value, window area sqm, orientation degrees) · Target: energy usage in kWh · Building energy regression with trigonometric orientation effect.
loadCropYield() — 200 samples · 3 features (rainfall mm, fertilizer kg/ha, temperature °C) · Target: yield in tons/ha · Non-linear temperature effect (quadratic peak at 25°C).
loadCustomerSegments() — 200 samples · 3 features (age, income thousands, spending score 0–100) · 4 natural clusters (young_budget, young_premium, mature_moderate, mature_saver) · Ideal for KMeans and DBSCAN clustering benchmarks.
loadFitnessScores() — 100 samples · 3 features (exercise duration min, intensity 1–10, frequency times/week) → 3 targets (strength, endurance, flexibility) · Multi-output regression.
loadWeatherOutcomes() — 150 samples · 3 features (humidity %, pressure hPa, temperature °C) → 2 targets (rain probability, wind speed km/h) · Multi-output regression with non-linear dynamics.

Geometric / Non-Linear Datasets (2D–3D)

loadMoonsMulti() — 150 samples · 2D · 3 interleaving rotated moon-shaped classes · Tests non-linear classifiers; linear models will fail.
loadConcentricRings() — 150 samples · 2D · 3 concentric circle classes with radii 1.0, 2.5, 4.0 · Requires radial basis or kernel-based classifiers.
loadSpiralArms() — 150 samples · 2D · 3 interleaving spiral arms · Extremely non-linear; tests deep networks and kernel SVMs.
loadGaussianIslands() — 200 samples · 3D · 4 well-separated Gaussian clusters centered at (±3, ±3, ±3) · Clean clustering benchmark in higher dimensions.
loadPerfectlySeparable() — 100 samples · 4 features · 2 linearly separable classes with zero overlap · Sanity check for any classifier; all models should achieve 100% accuracy.

builtin-datasets.ts

import { loadIris, loadDigits, loadBreastCancer, loadHousingMini, loadSpiralArms } from "deepbox/datasets";import { trainTestSplit } from "deepbox/preprocess";// ── Classic benchmark ──const iris = loadIris();console.log(iris.data.shape);     // [150, 4]console.log(iris.target.shape);   // [150]console.log(iris.featureNames);   // ['sepal length (cm)', 'sepal width (cm)', ...]console.log(iris.targetNames);    // ['setosa', 'versicolor', 'virginica']console.log(iris.description);    // 'Synthetic ... 150 samples, 4 features, 3 classes.'// ── Split for training ──const [XTrain, XTest, yTrain, yTest] = trainTestSplit(  iris.data, iris.target, { testSize: 0.2, randomState: 42 });console.log(XTrain.shape); // [120, 4]console.log(XTest.shape);  // [30, 4]// ── Digits: small image classification ──const digits = loadDigits();console.log(digits.data.shape);   // [1797, 64]console.log(digits.target.shape); // [1797]// ── Regression: housing prices ──const housing = loadHousingMini();console.log(housing.featureNames); // ['size (sqm)', 'rooms', 'age (years)', 'distance to center (km)']// ── Non-linear: spiral arms ──const spirals = loadSpiralArms();console.log(spirals.data.shape);  // [150, 2]console.log(spirals.targetNames); // ['arm_0', 'arm_1', 'arm_2']

Key Points

All datasets are deterministic — seeded RNGs produce identical data on every call
No network access, no file I/O — data is generated in-memory on first call and cached
Every loader returns the same Dataset type with data, target, featureNames, and description
Classification targets use dtype 'int32'; regression targets use 'float32'
Use trainTestSplit() from deepbox/preprocess to split any dataset for training and evaluation

ML Visualizations

Synthetic Data Generators