deepbox/preprocess

Data Splitting

Split datasets into training, testing, and validation sets. Supports stratified splitting and k-fold cross-validation.

trainTestSplit

trainTestSplit(X: Tensor, y: Tensor, opts?: SplitOptions): [Tensor, Tensor, Tensor, Tensor]

Split arrays into random train and test subsets. Returns [XTrain, XTest, yTrain, yTest].

Parameters:

opts.testSize: number - Fraction of data for testing (default: 0.25)

opts.randomState: number - Seed for reproducibility

opts.stratify: Tensor - If provided, ensures each split has the same class distribution

opts.shuffle: boolean - Whether to shuffle before splitting (default: true)

KFold

K-Fold cross-validation iterator. Splits data into k consecutive folds. Each fold is used once as test set while the remaining k−1 folds form the training set.

StratifiedKFold

Stratified K-Fold that preserves class distribution in each fold. Ensures each fold has approximately the same percentage of each class as the complete set.

LeaveOneOut

Leave-One-Out cross-validation. Each sample is used once as the test set. Equivalent to KFold(n) where n is the number of samples. Computationally expensive.

LeavePOut

Leave-P-Out cross-validation. All possible subsets of p samples are used as the test set. Generalizes LeaveOneOut.

GroupKFold

K-Fold variant that ensures the same group is not in both training and test sets. Useful when samples from the same group (e.g., same patient) should not be split.

splitting.ts

import { trainTestSplit, KFold, StratifiedKFold } from "deepbox/preprocess";import { tensor } from "deepbox/ndarray";const X = tensor([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14], [15, 16], [17, 18], [19, 20]]);const y = tensor([0, 0, 0, 0, 0, 1, 1, 1, 1, 1]);// Simple train/test splitconst [XTrain, XTest, yTrain, yTest] = trainTestSplit(X, y, {  testSize: 0.2,  randomState: 42,});// K-Fold cross-validationconst kf = new KFold({ nSplits: 5, shuffle: true, randomState: 42 });for (const { trainIndex, testIndex } of kf.split(X)) {  // trainIndex and testIndex are arrays of indices}// Stratified K-Fold (preserves class distribution)const skf = new StratifiedKFold({ nSplits: 3 });for (const { trainIndex, testIndex } of skf.split(X, y)) {  // Each fold has same class ratio as the full dataset}

Encoders

Classification Metrics