Deep Learning
Glossary
200+ terms explained in plain English. No PhD required. Click any term to see a clear, example-driven explanation.
Activation Function
A mathematical function applied to a neuron's output to introduce non-linearity into the network. Without activation functions, stacking layers would be equivalent to a single linear transformation. Common examples: ReLU, Sigmoid, Tanh, GELU.
Attention Mechanism
A way for models to dynamically focus on different parts of the input when producing each output. Instead of encoding everything into one vector, attention allows the model to "look back" at relevant parts. The foundation of transformers.
Autoencoder
A neural network trained to compress its input into a compact representation (encoding), then reconstruct the original from that representation (decoding). Used for dimensionality reduction, anomaly detection, and as a component of generative models.
Backpropagation
Short for "backward propagation of errors." The algorithm used to compute gradients of the loss with respect to each parameter in a neural network. It works by applying the chain rule of calculus backwards through the network's layers. This is what enables neural networks to learn from their mistakes.
Batch Normalization
A technique that normalizes the inputs to each layer to have zero mean and unit variance during training. This stabilizes and accelerates training, allowing higher learning rates. Applied after a linear transformation but before the activation function.
Bias (in neurons)
A learnable scalar added to the weighted sum of inputs before the activation function. Biases allow the activation function to be shifted left or right, giving the network more flexibility to fit data. Not to be confused with statistical bias.
Convolutional Neural Network (CNN)
A type of neural network especially effective for processing grid-like data such as images. Uses convolution operations — sliding a filter over the input — to detect local patterns like edges, textures, and shapes. The backbone of most computer vision systems.
Cross-Entropy Loss
A loss function for classification tasks. Measures the difference between the predicted probability distribution and the true one-hot distribution. The lower the loss, the more confident the model is about the correct class. Formula: -Σ y·log(ŷ).
Deep Learning
A subfield of machine learning that uses neural networks with many layers ("deep" networks) to learn hierarchical representations of data. The "depth" allows the model to learn increasingly abstract features from raw input. Powers modern AI in vision, language, and beyond.
Dropout
A regularization technique where random neurons are set to zero with probability p during each training forward pass. This prevents the network from becoming too reliant on specific neurons, reducing overfitting. At test time, all neurons are used but scaled accordingly.
Diffusion Model
A generative model that learns to reverse a process of gradually adding noise to data. During training, it learns to denoise. At generation time, it starts from pure noise and iteratively denoises to produce realistic images or other outputs. Powers DALL-E, Stable Diffusion, and Midjourney.
Embedding
A dense vector representation of something — a word, a user, an image — in a continuous, lower-dimensional space. Items that are semantically similar end up close together in this space. The basis of Word2Vec, recommendation systems, and transformer models.
Epoch
One complete pass through the entire training dataset. Neural networks are typically trained for many epochs. Too few and the model underfits; too many and it overfits. Monitoring validation loss across epochs guides when to stop training.
Fine-Tuning
Taking a pre-trained model (trained on a large dataset) and continuing to train it on a smaller, task-specific dataset. The model's existing knowledge is preserved while it adapts to the new task. Much more data-efficient than training from scratch.
Forward Pass
The process of passing input data through the network from input layer to output layer, computing activations at each step. Produces the model's prediction. The forward pass must happen before backpropagation can compute gradients.
GAN (Generative Adversarial Network)
A framework with two competing networks: a Generator that creates fake data and a Discriminator that tries to tell real from fake. They train together in a game-like setup until the Generator produces realistic outputs. Pioneered photorealistic image synthesis.
Gradient Descent
An optimization algorithm that iteratively adjusts parameters in the direction that most reduces the loss, as indicated by the gradient (the direction of steepest increase in loss). Take small steps in the opposite direction of the gradient, and you move toward a minimum.
Hyperparameter
Configuration settings chosen before training begins that control the training process itself — not learned from data. Examples: learning rate, batch size, number of layers, dropout rate, optimizer type. Choosing good hyperparameters is a crucial part of model development.
Learning Rate
A hyperparameter that controls how large a step to take in the direction of the negative gradient during each update. Too high and training is unstable; too low and training is slow. Often the most impactful hyperparameter to tune.
LSTM (Long Short-Term Memory)
A type of recurrent neural network unit designed to capture long-range dependencies in sequences. Uses three gates — input, forget, output — to selectively remember or forget information over time. Largely superseded by transformers but still used in some applications.
Loss Function
A function that measures how wrong the model's predictions are. The training process minimizes this function. Common losses: Mean Squared Error (regression), Cross-Entropy (classification), Contrastive Loss (similarity). Choosing the right loss function is essential.
Model Inference
The process of using a trained model to make predictions on new data. Contrast with training (where the model learns). During inference, gradients are not computed and weights are not updated, making it faster and requiring less memory.
Neural Network
A computational model loosely inspired by the brain, consisting of layers of interconnected nodes (neurons). Each connection has a weight; each neuron applies an activation function. Given enough data and compute, neural networks can learn surprisingly complex functions.
Overfitting
When a model learns the training data too well — including its noise and random quirks — and fails to generalize to new unseen data. The model essentially memorizes rather than learns. Signs: low training loss, high validation loss. Solutions: regularization, more data, simpler model.
Optimizer
The algorithm that updates model weights based on computed gradients. SGD (Stochastic Gradient Descent) is the classic; Adam combines momentum and adaptive learning rates and is the most popular today. Choosing the right optimizer can significantly speed up training.
Perceptron
The simplest neural network unit — a single neuron. Takes a weighted sum of inputs, adds a bias, and passes through a step function. Introduced by Frank Rosenblatt in 1957. The theoretical foundation for all modern neural networks.
Pooling
A downsampling operation in CNNs that reduces spatial dimensions while preserving important features. Max pooling takes the maximum value in each region; average pooling takes the mean. Reduces computation and adds some translation invariance.
ReLU (Rectified Linear Unit)
The most commonly used activation function. Defined as f(x) = max(0, x) — it passes positive values unchanged and zeros out negative values. Computationally cheap, avoids the vanishing gradient problem that plagued sigmoid/tanh. Variants: Leaky ReLU, PReLU, GELU.
Residual Network (ResNet)
A CNN architecture that uses skip connections — pathways that bypass one or more layers. The output of a block is the sum of the block's output and its input. This allows training of very deep networks (50, 100+ layers) without gradient vanishing. A foundational architecture in computer vision.
Softmax
An activation function applied to the output layer for multi-class classification. Converts a vector of raw scores into a probability distribution that sums to 1. The class with the highest probability is the prediction. Formula: σ(xᵢ) = e^xᵢ / Σe^xⱼ
Transformer
An architecture introduced in "Attention Is All You Need" (2017) that relies entirely on self-attention rather than recurrence. Transformers process entire sequences in parallel, enabling much more efficient training. The basis of GPT, BERT, T5, and virtually all modern NLP and many vision models.
Transfer Learning
Using knowledge gained from training on one task to improve performance on a different but related task. In practice: take a model trained on ImageNet or a large text corpus, then fine-tune it on your smaller dataset. Dramatically reduces the data and compute needed.
Vanishing Gradient Problem
A problem in deep networks where gradients become exponentially small as they're backpropagated through many layers, making early layers learn extremely slowly or not at all. Was a major barrier to training deep networks before ReLU activations, residual connections, and better initialization methods.
Weight
A learnable scalar parameter associated with a connection between neurons. Determines how much one neuron's output influences the next. Training a neural network is essentially finding the right values for all its weights (and biases) to minimize the loss function.