Reference

Deep Learning
Cheat Sheets

Quick-reference cards for PyTorch, key formulas, activation functions, common architectures, and training tricks. Print them, bookmark them, use them.

🔥

PyTorch Essentials

The most-used PyTorch patterns for training neural networks

Create tensortorch.tensor([1,2,3])

Zeros/Onestorch.zeros(3,4)

Randomtorch.randn(3,4)

Move to GPUtensor.cuda() / .to('cuda')

Gradient offtorch.no_grad()

Zero gradsoptimizer.zero_grad()

Backproploss.backward()

Update stepoptimizer.step()

View Full Sheet PDF ↓

⚡

Activation Functions

Formulas and characteristics of common activations

ReLUmax(0, x)

Leaky ReLUmax(αx, x), α=0.01

Sigmoid1 / (1 + e^−x)

Tanh(e^x − e^−x)/(e^x + e^−x)

Softmaxe^xᵢ / Σe^xⱼ

GELUx·Φ(x)

Swishx·sigmoid(x)

Learn More PDF ↓

📉

Loss Functions

When to use which loss function

MSE (regression)mean((ŷ − y)²)

MAE (regression)mean(|ŷ − y|)

Cross-Entropy (cls)−Σy·log(ŷ)

Binary CE (2-class)nn.BCELoss()

Huber (robust regr)nn.HuberLoss()

KL DivergenceΣp·log(p/q)

ContrastiveUsed in CLIP, SimCLR

Learn More PDF ↓

🏗️

Common Architectures

What to use for which problem

Image classificationResNet, EfficientNet

Object detectionYOLO, Faster R-CNN

SegmentationU-Net, Mask R-CNN

Text classificationBERT, DistilBERT

Text generationGPT-2, LLaMA

Seq2Seq / translateT5, mBART

Image generationStable Diffusion

AudioWhisper, WaveNet

Explore Topics PDF ↓

🎛️

Optimizers Quick Ref

Common optimizers and when to use them

SGDoptim.SGD(lr=0.01)

SGD + Momentummomentum=0.9

Adamoptim.Adam(lr=1e-3)

AdamWweight_decay=0.01

RMSpropoptim.RMSprop

Best defaultAdamW, lr=3e-4

LR schedulerCosineAnnealingLR

Learn More PDF ↓

🩺

Training Diagnostics

How to debug your training runs

Loss not decreasing↓ LR or check data

Loss oscillating↓ LR

Loss → NaN↓ LR or clip grads

OverfittingMore data or dropout

UnderfittingBigger model or ↑ LR

Slow trainingIncrease batch size

GPU util < 70%Increase batch size

Learn More PDF ↓

Key Formulas

The Math You Actually Need

Neuron Output

z = Wx + b
a = activation(z)

W = weights, x = inputs, b = bias. a = activated output.

Mean Squared Error

L = (1/n) Σ(yᵢ − ŷᵢ)²

Average squared difference between predictions and targets. Used for regression.

Gradient Descent Update

θ = θ − α · ∂L/∂θ

θ = parameters, α = learning rate, ∂L/∂θ = gradient of loss with respect to parameter.

Softmax

σ(xᵢ) = e^xᵢ / Σⱼ e^xⱼ

Converts raw scores to probabilities summing to 1. Used for multi-class classification.

Adam Update Rule

m = β₁m + (1−β₁)g
v = β₂v + (1−β₂)g²
θ = θ − α·m̂/√v̂+ε

Combines momentum (m) and adaptive learning rates (v). Default: β₁=0.9, β₂=0.999.

Attention Score

Attention = softmax(QKᵀ/√d)V

Q=queries, K=keys, V=values, d=dimension. The core of transformer models.

Want to understand where these formulas come from?

Start from the Fundamentals →

Printable Sheets Coming Soon

Downloadable PDF versions are being prepared. Use the topic links above for now.

Deep LearningCheat Sheets

PyTorch Essentials

Activation Functions

Loss Functions

Common Architectures

Optimizers Quick Ref

Training Diagnostics

The Math You Actually Need

Printable Sheets Coming Soon

Deep Learning
Cheat Sheets