Deep Learning
Cheat Sheets

Quick-reference cards for PyTorch, key formulas, activation functions, common architectures, and training tricks. Print them, bookmark them, use them.

🔥

PyTorch Essentials

The most-used PyTorch patterns for training neural networks

Create tensortorch.tensor([1,2,3])
Zeros/Onestorch.zeros(3,4)
Randomtorch.randn(3,4)
Move to GPUtensor.cuda() / .to('cuda')
Gradient offtorch.no_grad()
Zero gradsoptimizer.zero_grad()
Backproploss.backward()
Update stepoptimizer.step()

Activation Functions

Formulas and characteristics of common activations

ReLUmax(0, x)
Leaky ReLUmax(αx, x), α=0.01
Sigmoid1 / (1 + e^−x)
Tanh(e^x − e^−x)/(e^x + e^−x)
Softmaxe^xᵢ / Σe^xⱼ
GELUx·Φ(x)
Swishx·sigmoid(x)
📉

Loss Functions

When to use which loss function

MSE (regression)mean((ŷ − y)²)
MAE (regression)mean(|ŷ − y|)
Cross-Entropy (cls)−Σy·log(ŷ)
Binary CE (2-class)nn.BCELoss()
Huber (robust regr)nn.HuberLoss()
KL DivergenceΣp·log(p/q)
ContrastiveUsed in CLIP, SimCLR
🏗️

Common Architectures

What to use for which problem

Image classificationResNet, EfficientNet
Object detectionYOLO, Faster R-CNN
SegmentationU-Net, Mask R-CNN
Text classificationBERT, DistilBERT
Text generationGPT-2, LLaMA
Seq2Seq / translateT5, mBART
Image generationStable Diffusion
AudioWhisper, WaveNet
🎛️

Optimizers Quick Ref

Common optimizers and when to use them

SGDoptim.SGD(lr=0.01)
SGD + Momentummomentum=0.9
Adamoptim.Adam(lr=1e-3)
AdamWweight_decay=0.01
RMSpropoptim.RMSprop
Best defaultAdamW, lr=3e-4
LR schedulerCosineAnnealingLR
🩺

Training Diagnostics

How to debug your training runs

Loss not decreasing↓ LR or check data
Loss oscillating↓ LR
Loss → NaN↓ LR or clip grads
OverfittingMore data or dropout
UnderfittingBigger model or ↑ LR
Slow trainingIncrease batch size
GPU util < 70%Increase batch size

The Math You Actually Need

Neuron Output
z = Wx + b
a = activation(z)
W = weights, x = inputs, b = bias. a = activated output.
Mean Squared Error
L = (1/n) Σ(yᵢ − ŷᵢ)²
Average squared difference between predictions and targets. Used for regression.
Gradient Descent Update
θ = θ − α · ∂L/∂θ
θ = parameters, α = learning rate, ∂L/∂θ = gradient of loss with respect to parameter.
Softmax
σ(xᵢ) = e^xᵢ / Σⱼ e^xⱼ
Converts raw scores to probabilities summing to 1. Used for multi-class classification.
Adam Update Rule
m = β₁m + (1−β₁)g
v = β₂v + (1−β₂)g²
θ = θ − α·m̂/√v̂+ε
Combines momentum (m) and adaptive learning rates (v). Default: β₁=0.9, β₂=0.999.
Attention Score
Attention = softmax(QKᵀ/√d)V
Q=queries, K=keys, V=values, d=dimension. The core of transformer models.

Want to understand where these formulas come from?

Start from the Fundamentals →

Printable Sheets Coming Soon

Downloadable PDF versions are being prepared. Use the topic links above for now.