Reference
Deep Learning
Cheat Sheets
Quick-reference cards for PyTorch, key formulas, activation functions, common architectures, and training tricks. Print them, bookmark them, use them.
🔥
PyTorch Essentials
The most-used PyTorch patterns for training neural networks
Create tensor
torch.tensor([1,2,3])Zeros/Ones
torch.zeros(3,4)Random
torch.randn(3,4)Move to GPU
tensor.cuda() / .to('cuda')Gradient off
torch.no_grad()Zero grads
optimizer.zero_grad()Backprop
loss.backward()Update step
optimizer.step()⚡
Activation Functions
Formulas and characteristics of common activations
ReLU
max(0, x)Leaky ReLU
max(αx, x), α=0.01Sigmoid
1 / (1 + e^−x)Tanh
(e^x − e^−x)/(e^x + e^−x)Softmax
e^xᵢ / Σe^xⱼGELU
x·Φ(x)Swish
x·sigmoid(x)📉
Loss Functions
When to use which loss function
MSE (regression)
mean((ŷ − y)²)MAE (regression)
mean(|ŷ − y|)Cross-Entropy (cls)
−Σy·log(ŷ)Binary CE (2-class)
nn.BCELoss()Huber (robust regr)
nn.HuberLoss()KL Divergence
Σp·log(p/q)Contrastive
Used in CLIP, SimCLR🏗️
Common Architectures
What to use for which problem
Image classification
ResNet, EfficientNetObject detection
YOLO, Faster R-CNNSegmentation
U-Net, Mask R-CNNText classification
BERT, DistilBERTText generation
GPT-2, LLaMASeq2Seq / translate
T5, mBARTImage generation
Stable DiffusionAudio
Whisper, WaveNet🎛️
Optimizers Quick Ref
Common optimizers and when to use them
SGD
optim.SGD(lr=0.01)SGD + Momentum
momentum=0.9Adam
optim.Adam(lr=1e-3)AdamW
weight_decay=0.01RMSprop
optim.RMSpropBest default
AdamW, lr=3e-4LR scheduler
CosineAnnealingLR🩺
Training Diagnostics
How to debug your training runs
Loss not decreasing
↓ LR or check dataLoss oscillating
↓ LRLoss → NaN
↓ LR or clip gradsOverfitting
More data or dropoutUnderfitting
Bigger model or ↑ LRSlow training
Increase batch sizeGPU util < 70%
Increase batch sizeKey Formulas
The Math You Actually Need
Neuron Output
z = Wx + b
a = activation(z)
a = activation(z)
W = weights, x = inputs, b = bias. a = activated output.
Mean Squared Error
L = (1/n) Σ(yᵢ − ŷᵢ)²
Average squared difference between predictions and targets. Used for regression.
Gradient Descent Update
θ = θ − α · ∂L/∂θ
θ = parameters, α = learning rate, ∂L/∂θ = gradient of loss with respect to
parameter.
Softmax
σ(xᵢ) = e^xᵢ / Σⱼ e^xⱼ
Converts raw scores to probabilities summing to 1. Used for multi-class
classification.
Adam Update Rule
m = β₁m + (1−β₁)g
v = β₂v + (1−β₂)g²
θ = θ − α·m̂/√v̂+ε
v = β₂v + (1−β₂)g²
θ = θ − α·m̂/√v̂+ε
Combines momentum (m) and adaptive learning rates (v). Default: β₁=0.9, β₂=0.999.
Attention Score
Attention = softmax(QKᵀ/√d)V
Q=queries, K=keys, V=values, d=dimension. The core of transformer models.
Want to understand where these formulas come from?
Start from the Fundamentals →Printable Sheets Coming Soon
Downloadable PDF versions are being prepared. Use the topic links above for now.