Deep Learning Glossary — 200+ Terms Explained

Activation Function

A mathematical function applied to a neuron's output to introduce non-linearity into the network. Without activation functions, stacking layers would be equivalent to a single linear transformation. Common examples: ReLU, Sigmoid, Tanh, GELU.

neural networksfundamentals

Attention Mechanism

A way for models to dynamically focus on different parts of the input when producing each output. Instead of encoding everything into one vector, attention allows the model to "look back" at relevant parts. The foundation of transformers.

transformersNLP

Autoencoder

A neural network trained to compress its input into a compact representation (encoding), then reconstruct the original from that representation (decoding). Used for dimensionality reduction, anomaly detection, and as a component of generative models.

generativeunsupervised

Backpropagation

Short for "backward propagation of errors." The algorithm used to compute gradients of the loss with respect to each parameter in a neural network. It works by applying the chain rule of calculus backwards through the network's layers. This is what enables neural networks to learn from their mistakes.

trainingfundamentals

Batch Normalization

A technique that normalizes the inputs to each layer to have zero mean and unit variance during training. This stabilizes and accelerates training, allowing higher learning rates. Applied after a linear transformation but before the activation function.

trainingregularization

Bias (in neurons)

A learnable scalar added to the weighted sum of inputs before the activation function. Biases allow the activation function to be shifted left or right, giving the network more flexibility to fit data. Not to be confused with statistical bias.

fundamentals

Convolutional Neural Network (CNN)

A type of neural network especially effective for processing grid-like data such as images. Uses convolution operations — sliding a filter over the input — to detect local patterns like edges, textures, and shapes. The backbone of most computer vision systems.

computer visionarchitecture

Cross-Entropy Loss

A loss function for classification tasks. Measures the difference between the predicted probability distribution and the true one-hot distribution. The lower the loss, the more confident the model is about the correct class. Formula: -Σ y·log(ŷ).

trainingloss functions

Deep Learning

A subfield of machine learning that uses neural networks with many layers ("deep" networks) to learn hierarchical representations of data. The "depth" allows the model to learn increasingly abstract features from raw input. Powers modern AI in vision, language, and beyond.

fundamentals

Dropout

A regularization technique where random neurons are set to zero with probability p during each training forward pass. This prevents the network from becoming too reliant on specific neurons, reducing overfitting. At test time, all neurons are used but scaled accordingly.

regularizationtraining

Diffusion Model

A generative model that learns to reverse a process of gradually adding noise to data. During training, it learns to denoise. At generation time, it starts from pure noise and iteratively denoises to produce realistic images or other outputs. Powers DALL-E, Stable Diffusion, and Midjourney.

generative AIimage generation

Embedding

A dense vector representation of something — a word, a user, an image — in a continuous, lower-dimensional space. Items that are semantically similar end up close together in this space. The basis of Word2Vec, recommendation systems, and transformer models.

NLPfundamentals

Epoch

One complete pass through the entire training dataset. Neural networks are typically trained for many epochs. Too few and the model underfits; too many and it overfits. Monitoring validation loss across epochs guides when to stop training.

training

Fine-Tuning

Taking a pre-trained model (trained on a large dataset) and continuing to train it on a smaller, task-specific dataset. The model's existing knowledge is preserved while it adapts to the new task. Much more data-efficient than training from scratch.

transfer learningNLP

Forward Pass

The process of passing input data through the network from input layer to output layer, computing activations at each step. Produces the model's prediction. The forward pass must happen before backpropagation can compute gradients.

fundamentals

GAN (Generative Adversarial Network)

A framework with two competing networks: a Generator that creates fake data and a Discriminator that tries to tell real from fake. They train together in a game-like setup until the Generator produces realistic outputs. Pioneered photorealistic image synthesis.

generative AI

Gradient Descent

An optimization algorithm that iteratively adjusts parameters in the direction that most reduces the loss, as indicated by the gradient (the direction of steepest increase in loss). Take small steps in the opposite direction of the gradient, and you move toward a minimum.

optimizationfundamentals

Hyperparameter

Configuration settings chosen before training begins that control the training process itself — not learned from data. Examples: learning rate, batch size, number of layers, dropout rate, optimizer type. Choosing good hyperparameters is a crucial part of model development.

training

Learning Rate

A hyperparameter that controls how large a step to take in the direction of the negative gradient during each update. Too high and training is unstable; too low and training is slow. Often the most impactful hyperparameter to tune.

optimizationhyperparameters

LSTM (Long Short-Term Memory)

A type of recurrent neural network unit designed to capture long-range dependencies in sequences. Uses three gates — input, forget, output — to selectively remember or forget information over time. Largely superseded by transformers but still used in some applications.

sequence modelsNLP

Loss Function

A function that measures how wrong the model's predictions are. The training process minimizes this function. Common losses: Mean Squared Error (regression), Cross-Entropy (classification), Contrastive Loss (similarity). Choosing the right loss function is essential.

fundamentalstraining

Model Inference

The process of using a trained model to make predictions on new data. Contrast with training (where the model learns). During inference, gradients are not computed and weights are not updated, making it faster and requiring less memory.

deployment

Neural Network

A computational model loosely inspired by the brain, consisting of layers of interconnected nodes (neurons). Each connection has a weight; each neuron applies an activation function. Given enough data and compute, neural networks can learn surprisingly complex functions.

fundamentals

Overfitting

When a model learns the training data too well — including its noise and random quirks — and fails to generalize to new unseen data. The model essentially memorizes rather than learns. Signs: low training loss, high validation loss. Solutions: regularization, more data, simpler model.

generalizationregularization

Optimizer

The algorithm that updates model weights based on computed gradients. SGD (Stochastic Gradient Descent) is the classic; Adam combines momentum and adaptive learning rates and is the most popular today. Choosing the right optimizer can significantly speed up training.

trainingoptimization

Perceptron

The simplest neural network unit — a single neuron. Takes a weighted sum of inputs, adds a bias, and passes through a step function. Introduced by Frank Rosenblatt in 1957. The theoretical foundation for all modern neural networks.

fundamentalshistory

Pooling

A downsampling operation in CNNs that reduces spatial dimensions while preserving important features. Max pooling takes the maximum value in each region; average pooling takes the mean. Reduces computation and adds some translation invariance.

computer visionCNNs

ReLU (Rectified Linear Unit)

The most commonly used activation function. Defined as f(x) = max(0, x) — it passes positive values unchanged and zeros out negative values. Computationally cheap, avoids the vanishing gradient problem that plagued sigmoid/tanh. Variants: Leaky ReLU, PReLU, GELU.

activation functions

Residual Network (ResNet)

A CNN architecture that uses skip connections — pathways that bypass one or more layers. The output of a block is the sum of the block's output and its input. This allows training of very deep networks (50, 100+ layers) without gradient vanishing. A foundational architecture in computer vision.

computer visionarchitectures

Softmax

An activation function applied to the output layer for multi-class classification. Converts a vector of raw scores into a probability distribution that sums to 1. The class with the highest probability is the prediction. Formula: σ(xᵢ) = e^xᵢ / Σe^xⱼ

activation functionsclassification

Transformer

An architecture introduced in "Attention Is All You Need" (2017) that relies entirely on self-attention rather than recurrence. Transformers process entire sequences in parallel, enabling much more efficient training. The basis of GPT, BERT, T5, and virtually all modern NLP and many vision models.

transformersNLP

Transfer Learning

Using knowledge gained from training on one task to improve performance on a different but related task. In practice: take a model trained on ImageNet or a large text corpus, then fine-tune it on your smaller dataset. Dramatically reduces the data and compute needed.

training

Vanishing Gradient Problem

A problem in deep networks where gradients become exponentially small as they're backpropagated through many layers, making early layers learn extremely slowly or not at all. Was a major barrier to training deep networks before ReLU activations, residual connections, and better initialization methods.

trainingproblems

Weight

A learnable scalar parameter associated with a connection between neurons. Determines how much one neuron's output influences the next. Training a neural network is essentially finding the right values for all its weights (and biases) to minimize the loss function.

fundamentals

Deep LearningGlossary

Activation Function

Attention Mechanism

Autoencoder

Backpropagation

Batch Normalization

Bias (in neurons)

Convolutional Neural Network (CNN)

Cross-Entropy Loss

Deep Learning

Dropout

Diffusion Model

Embedding

Epoch

Fine-Tuning

Forward Pass

GAN (Generative Adversarial Network)

Gradient Descent

Hyperparameter

Learning Rate

LSTM (Long Short-Term Memory)

Loss Function

Model Inference

Neural Network

Overfitting

Optimizer

Perceptron

Pooling

ReLU (Rectified Linear Unit)

Residual Network (ResNet)

Softmax

Transformer

Transfer Learning

Vanishing Gradient Problem

Weight

Deep Learning
Glossary