deepbox/nn

Attention & Transformer

Attention mechanisms and Transformer building blocks for sequence-to-sequence tasks, NLP, and beyond.

MultiheadAttention

extends Module

Multi-head scaled dot-product attention. Splits queries, keys, and values into multiple heads, applies attention in parallel, then concatenates. Core building block of Transformers.

TransformerEncoderLayer

extends Module

Single Transformer encoder layer: multi-head self-attention → add & norm → feedforward → add & norm. Stack multiple for a full Transformer encoder.

Scaled Dot-Product Attention

Attention(Q,K,V) = softmax(QKᵀ / √dₖ) · V

Where:

dₖ = Key dimension (scaling factor)

attention.ts

import { MultiheadAttention, TransformerEncoderLayer } from "deepbox/nn";import { tensor } from "deepbox/ndarray";// Multi-head attentionconst mha = new MultiheadAttention(512, 8); // embedDim=512, numHeads=8// Q, K, V all shape: (batch, seqLen, embedDim)// Transformer encoder layerconst encoder = new TransformerEncoderLayer({  dModel: 512,  nHead: 8,  dimFeedforward: 2048,  dropout: 0.1,});

Recurrent (RNN/LSTM/GRU)

Normalization & Dropout