GitHub
deepbox/datasets

Synthetic Data Generators

Parametric generators that create datasets with controllable geometry, difficulty, noise, and dimensionality. Use these for unit testing, benchmarking, teaching, and exploring model behavior. Every generator accepts a randomState seed for full reproducibility.
makeBlobs
makeBlobs(opts?: MakeBlobsOptions): [Tensor, Tensor]

Generate isotropic Gaussian blobs for clustering and classification. Each cluster is sampled from N(μₖ, σ²I) where μₖ is the cluster center and σ is clusterStd. Centers are randomly placed unless explicitly provided. The default produces well-separated clusters ideal for testing KMeans.

Parameters:
opts.nSamples: number - Total number of points equally divided among clusters (default: 100)
opts.nFeatures: number - Dimensionality of each point (default: 2)
opts.centers: number | number[][] - Number of clusters or explicit center coordinates (default: 3)
opts.clusterStd: number - Standard deviation of each cluster (default: 1.0). Lower values → tighter clusters.
opts.randomState: number - Seed for reproducible generation
makeCircles
makeCircles(opts?: MakeCirclesOptions): [Tensor, Tensor]

Generate two concentric circles — an inner ring (class 0) and an outer ring (class 1). Points are uniformly distributed around each circle with optional Gaussian noise. Linear classifiers cannot separate these classes; use kernel SVM, decision trees, or neural networks instead. The factor parameter controls how close the two circles are.

Parameters:
opts.nSamples: number - Total number of points, split equally between inner and outer circles (default: 100)
opts.noise: number - Standard deviation of Gaussian noise added to each point (default: 0)
opts.factor: number - Ratio of inner circle radius to outer circle radius, in (0, 1). 0.5 means the inner circle has half the radius. (default: 0.8)
opts.randomState: number - Seed for reproducibility
makeMoons
makeMoons(opts?: MakeMoonsOptions): [Tensor, Tensor]

Generate two interleaving half-circle (crescent/moon) shapes. Class 0 is the upper moon, class 1 is the lower moon shifted right and down. With low noise, the two moons interlock but do not overlap. Another classic non-linear binary classification test — widely used to demonstrate decision boundaries.

Parameters:
opts.nSamples: number - Total points split equally between the two moons (default: 100)
opts.noise: number - Standard deviation of Gaussian noise (default: 0)
opts.randomState: number - Seed for reproducibility
makeClassification
makeClassification(opts?: MakeClassificationOptions): [Tensor, Tensor]

Generate a random n-class classification problem with fine-grained control over difficulty. Creates nInformative truly useful features, nRedundant linear combinations of informative features, and fills the rest with noise. Flip a fraction of labels to inject label noise. This is the most flexible generator for stress-testing classifiers.

Parameters:
opts.nSamples: number - Number of samples (default: 100)
opts.nFeatures: number - Total number of features (default: 20)
opts.nInformative: number - Number of informative features (default: 2)
opts.nRedundant: number - Number of redundant (linear combo) features (default: 2)
opts.nClasses: number - Number of classes (default: 2)
opts.flipY: number - Fraction of labels to randomly flip (default: 0.01)
opts.randomState: number - Seed for reproducibility
makeRegression
makeRegression(opts?: MakeRegressionOptions): [Tensor, Tensor]

Generate a random linear regression problem: y = Xw + noise. Features are drawn from N(0, 1), and the true coefficient vector w is returned so you can verify your model recovers it. Only nInformative features have non-zero coefficients; the rest are noise columns.

Parameters:
opts.nSamples: number - Number of samples (default: 100)
opts.nFeatures: number - Total number of features (default: 10)
opts.nInformative: number - Features with non-zero coefficients (default: 10)
opts.noise: number - Standard deviation of Gaussian noise added to y (default: 0)
opts.randomState: number - Seed for reproducibility
makeGaussianQuantiles
makeGaussianQuantiles(opts?: MakeGaussianQuantilesOptions): [Tensor, Tensor]

Generate an isotropic Gaussian cloud and partition samples into classes based on quantiles of the Mahalanobis distance from the origin. This creates concentric, roughly spherical class boundaries. With 2 classes the result looks like makeCircles but in arbitrary dimensions; with more classes you get nested shells.

Parameters:
opts.nSamples: number - Number of samples (default: 100)
opts.nFeatures: number - Number of features (default: 2)
opts.nClasses: number - Number of quantile-based classes (default: 3)
opts.randomState: number - Seed for reproducibility

makeBlobs

xᵢ ~ N(μₖ, σ²I) for cluster k

Where:

  • μₖ = Center of cluster k
  • σ = clusterStd parameter

makeRegression

y = Xw + ε, ε ~ N(0, noise²)

Where:

  • w = True coefficient vector (returned as coef)
  • ε = Gaussian noise

makeGaussianQuantiles

y = quantile_bin(‖x‖₂, nClasses)

Where:

  • ‖x‖₂ = Euclidean distance from origin
synthetic-datasets.ts
import { makeBlobs, makeCircles, makeMoons, makeRegression, makeClassification } from "deepbox/datasets";// ── Gaussian blobs for clustering ──const [X, y] = makeBlobs({  nSamples: 300,  centers: 3,  clusterStd: 0.5,  randomState: 42,});console.log(X.shape);       // [300, 2]console.log(y.shape);       // [300] — cluster labels// ── Concentric circles (non-linear binary) ──const [circlesX, circlesY] = makeCircles({ nSamples: 200, noise: 0.05, factor: 0.5 });// ── Interleaving moons (non-linear binary) ──const [moonsX, moonsY] = makeMoons({ nSamples: 200, noise: 0.1 });// ── Regression ──const [regX, regY] = makeRegression({ nSamples: 100, nFeatures: 5, noise: 0.1, randomState: 42 });console.log(regX.shape); // [100, 5]// ── Complex classification with noise ──const [clsX, clsY] = makeClassification({  nSamples: 500,  nFeatures: 20,  nInformative: 5,  nRedundant: 3,  nClasses: 4,  flipY: 0.05,  randomState: 42,});

Choosing a Generator

  • makeBlobs — Clustering (KMeans, DBSCAN), Gaussian mixture testing, simple multiclass classification
  • makeCircles — Non-linear binary classification benchmarks (kernel SVM, neural nets vs linear models)
  • makeMoons — Non-linear binary classification with interleaving structure (decision boundary visualization)
  • makeClassification — Stress-testing classifiers with controlled difficulty, feature redundancy, and label noise
  • makeRegression — Verifying regression models recover known coefficients; testing regularization behavior
  • makeGaussianQuantiles — Non-linear multiclass in arbitrary dimensions with concentric spherical boundaries