deepbox/datasets
Built-in Datasets
24 deterministic, synthetic datasets bundled directly inside the package. No downloads, no network access, no external files. Every loader returns a Dataset object with typed Tensor data, target labels, feature names, and a description. All data is generated from seeded RNGs so calls are always reproducible.
Dataset
type Dataset = { data: Tensor; // Feature matrix [nSamples, nFeatures] target: Tensor; // Target vector [nSamples] or matrix [nSamples, nTargets] featureNames: string[]; // Human-readable column labels targetNames?: string[]; // Class labels (classification) or target labels (regression) description: string; // One-line summary of the dataset};Classic Benchmark Datasets
- loadIris() — 150 samples · 4 features (sepal/petal length & width) · 3 classes (setosa, versicolor, virginica) · The most widely used dataset in ML education. Tests multiclass classification with overlapping class boundaries between versicolor and virginica.
- loadDigits() — 1797 samples · 64 features (8×8 pixel intensities 0–15) · 10 classes (digits 0–9) · A lightweight alternative to MNIST for testing image classifiers, dimensionality reduction, and clustering without large download sizes.
- loadBreastCancer() — 569 samples (212 malignant, 357 benign) · 30 features (mean, error, and worst of radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, fractal dimension) · Binary classification benchmark for medical diagnostics.
- loadDiabetes() — 442 samples · 10 features (age, sex, BMI, blood pressure, s1–s6 blood serum measurements) · Continuous target (disease progression after 1 year) · Standard regression benchmark.
- loadLinnerud() — 20 samples · 3 exercise features (chins, situps, jumps) → 3 physiological targets (weight, waist, pulse) · Multi-output regression with very few samples, useful for testing multivariate models.
Domain-Specific Datasets
- loadFlowersExtended() — 180 samples · 6 features (sepal/petal length & width, color intensity, stem thickness) · 4 species (setosa, versicolor, virginica, chrysantha) · Extended Iris with additional morphological features and a 4th class.
- loadLeafShapes() — 150 samples · 8 geometric features (area, perimeter, aspect ratio, curvature, compactness, lobedness, elongation, solidity) · 5 species (maple, oak, birch, willow, ginkgo) · Multi-class classification from leaf geometry.
- loadFruitQuality() — 150 samples · 5 features (weight, sugar content in Brix, acidity pH, firmness in Newtons, color score) · 3 classes (apple, orange, banana) · Food quality classification from measurable physical properties.
- loadSeedMorphology() — 150 samples · 4 features (length, width, roundness, density) · 3 seed types (wheat, rice, sunflower) · Classification from seed morphometric measurements.
- loadSensorStates() — 180 samples · 6 sensor readings (temperature, pressure, humidity, voltage, vibration, current) · 3 operating modes (normal, heating, fault) · Industrial IoT anomaly detection dataset.
- loadStudentPerformance() — 150 samples · 3 integer features (study hours/week, absences, quiz score 0–100) · 3 outcome classes (fail, pass, excellent) · Educational outcome prediction.
- loadTrafficConditions() — 150 samples · 3 features (time of day, speed km/h, density vehicles/km) · 3 classes (light, moderate, heavy) · Traffic flow classification.
Regression Datasets
- loadHousingMini() — 200 samples · 4 features (size sqm, rooms, age years, distance to center km) · Target: price in thousands · Regression with interpretable linear relationships.
- loadPlantGrowth() — 200 samples · 3 features (sunlight hours/day, water mL/day, soil quality 0–10) · Target: height in cm after 30 days · Agricultural regression with clear causal structure.
- loadEnergyEfficiency() — 200 samples · 3 features (insulation R-value, window area sqm, orientation degrees) · Target: energy usage in kWh · Building energy regression with trigonometric orientation effect.
- loadCropYield() — 200 samples · 3 features (rainfall mm, fertilizer kg/ha, temperature °C) · Target: yield in tons/ha · Non-linear temperature effect (quadratic peak at 25°C).
- loadCustomerSegments() — 200 samples · 3 features (age, income thousands, spending score 0–100) · 4 natural clusters (young_budget, young_premium, mature_moderate, mature_saver) · Ideal for KMeans and DBSCAN clustering benchmarks.
- loadFitnessScores() — 100 samples · 3 features (exercise duration min, intensity 1–10, frequency times/week) → 3 targets (strength, endurance, flexibility) · Multi-output regression.
- loadWeatherOutcomes() — 150 samples · 3 features (humidity %, pressure hPa, temperature °C) → 2 targets (rain probability, wind speed km/h) · Multi-output regression with non-linear dynamics.
Geometric / Non-Linear Datasets (2D–3D)
- loadMoonsMulti() — 150 samples · 2D · 3 interleaving rotated moon-shaped classes · Tests non-linear classifiers; linear models will fail.
- loadConcentricRings() — 150 samples · 2D · 3 concentric circle classes with radii 1.0, 2.5, 4.0 · Requires radial basis or kernel-based classifiers.
- loadSpiralArms() — 150 samples · 2D · 3 interleaving spiral arms · Extremely non-linear; tests deep networks and kernel SVMs.
- loadGaussianIslands() — 200 samples · 3D · 4 well-separated Gaussian clusters centered at (±3, ±3, ±3) · Clean clustering benchmark in higher dimensions.
- loadPerfectlySeparable() — 100 samples · 4 features · 2 linearly separable classes with zero overlap · Sanity check for any classifier; all models should achieve 100% accuracy.
builtin-datasets.ts
import { loadIris, loadDigits, loadBreastCancer, loadHousingMini, loadSpiralArms } from "deepbox/datasets";import { trainTestSplit } from "deepbox/preprocess";// ── Classic benchmark ──const iris = loadIris();console.log(iris.data.shape); // [150, 4]console.log(iris.target.shape); // [150]console.log(iris.featureNames); // ['sepal length (cm)', 'sepal width (cm)', ...]console.log(iris.targetNames); // ['setosa', 'versicolor', 'virginica']console.log(iris.description); // 'Synthetic ... 150 samples, 4 features, 3 classes.'// ── Split for training ──const [XTrain, XTest, yTrain, yTest] = trainTestSplit( iris.data, iris.target, { testSize: 0.2, randomState: 42 });console.log(XTrain.shape); // [120, 4]console.log(XTest.shape); // [30, 4]// ── Digits: small image classification ──const digits = loadDigits();console.log(digits.data.shape); // [1797, 64]console.log(digits.target.shape); // [1797]// ── Regression: housing prices ──const housing = loadHousingMini();console.log(housing.featureNames); // ['size (sqm)', 'rooms', 'age (years)', 'distance to center (km)']// ── Non-linear: spiral arms ──const spirals = loadSpiralArms();console.log(spirals.data.shape); // [150, 2]console.log(spirals.targetNames); // ['arm_0', 'arm_1', 'arm_2']Key Points
- All datasets are deterministic — seeded RNGs produce identical data on every call
- No network access, no file I/O — data is generated in-memory on first call and cached
- Every loader returns the same Dataset type with data, target, featureNames, and description
- Classification targets use dtype 'int32'; regression targets use 'float32'
- Use trainTestSplit() from deepbox/preprocess to split any dataset for training and evaluation