Attention & Transformer Layers

The Transformer architecture, based on self-attention, revolutionized NLP and now dominates many ML domains. This example demonstrates Deepbox's two attention layers: MultiheadAttention (computes scaled dot-product attention across multiple heads, allowing the model to attend to different positions simultaneously) and TransformerEncoderLayer (a full encoder block combining self-attention, feedforward network, layer normalization, and residual connections). You create both layers, pass sequence tensors through them, and inspect the output shapes. The example explains the query/key/value paradigm and how multi-head attention enables the model to learn different types of relationships.

Deepbox Modules Used

deepbox/ndarraydeepbox/nn

What You Will Learn

MultiheadAttention splits dModel into nHeads — each head learns different patterns
Query/Key/Value are projections of the input — self-attention uses the same input for all three
TransformerEncoderLayer = SelfAttention + FFN + LayerNorm + Residual connections
Attention output preserves sequence length and model dimension

Source Code

29-attention-transformer/index.ts

1import { randn } from "deepbox/ndarray";2import { MultiheadAttention, TransformerEncoderLayer } from "deepbox/nn";34console.log("=== Attention & Transformer ===\n");56// MultiheadAttention: dModel=64, nHeads=87const mha = new MultiheadAttention(64, 8);8const q = randn([2, 10, 64]);  // [batch, seq, dModel]9const k = randn([2, 10, 64]);10const v = randn([2, 10, 64]);11const attnOut = mha.forward(q, k, v);12console.log("MHA input:", q.shape);13console.log("MHA output:", attnOut.shape);  // [2, 10, 64]1415// TransformerEncoderLayer16const encoder = new TransformerEncoderLayer(64, 8, { dimFeedforward: 256 });17const src = randn([2, 10, 64]);18const encOut = encoder.forward(src);19console.log("\nEncoder input:", src.shape);20console.log("Encoder output:", encOut.shape);  // [2, 10, 64]

Console Output

$ npx tsx 29-attention-transformer/index.ts

=== Attention & Transformer ===

MHA input: [2, 10, 64]
MHA output: [2, 10, 64]

Encoder input: [2, 10, 64]
Encoder output: [2, 10, 64]

Recurrent Neural Network LayersPrevious Normalization & Dropout LayersNext