Machine Learning Glossary/L

Verfasst von / Written by Sebastian F. Genter

Machine Learning Glossary

L

L₀ regularization

Sparsity-inducing regularization that penalizes the count of non-zero weights. Encourages feature selection by eliminating less important parameters, often used in model compression. Computationally challenging due to non-convex nature.

L₁ loss

Robust regression loss measuring absolute differences (Mean Absolute Error). Formula: $\sum | y_{true} - y_{pred} |$ . Less sensitive to outliers than L₂ loss, creates piecewise linear optimization landscape.

L₁ regularization

(Lasso) Adds penalty proportional to absolute weight values: $λ \sum | w |$ . Produces sparse models by driving unimportant weights to exactly zero, enabling automatic feature selection.

L₂ loss

(Squared Error) Quadratic loss measuring squared deviations: $\sum (y_{true} - y_{pred})^{2}$ . Strongly convex optimization landscape, sensitive to outliers. Basis for ordinary least squares regression.

L₂ regularization

(Ridge) Penalizes squared magnitude of weights: $λ \sum w^{2}$ . Shrinks coefficients without eliminating features, improves conditioning of ill-posed problems. Equivalent to Gaussian prior in Bayesian interpretation.

label

Target variable being predicted in supervised learning. For image classification: object categories; for regression: continuous values. Labels form the ground truth for training and evaluation.

labeled example

Data point containing both input features and corresponding target value. Essential for supervised learning algorithms. Acquisition costs vary by domain - medical labels require expert annotation.

label leakage

When features unintentionally contain target information. Example: Including "patient cured" flag in features when predicting recovery time. Causes inflated performance metrics and failed real-world deployment.

lambda

Symbol ( $λ$ ) representing regularization strength hyperparameter. Controls tradeoff between fitting training data and model simplicity. Higher $λ$ increases regularization effect.

LaMDA

(Language Model for Dialogue Applications) Google's conversational AI focused on natural dialogue flow. Employs transformer architecture with modifications for improved coherence and factual grounding.

landmarks

Annotated reference points in visual data. Used in:

Facial recognition (eye corners, nose tip)
Medical imaging (anatomical markers)
AR applications

Enables pose estimation and alignment.

language model

Statistical model learning probability distributions over token sequences. Predicts next tokens given context. Evolution: N-gram → RNN → Transformer models. Basis for modern NLP systems.

large language model

Transformer-based models with $\geq 1$ B parameters trained on web-scale text. Exhibit emergent abilities like reasoning and in-context learning. Examples: GPT-4, PaLM, Claude. Require distributed training infrastructure.

latent space

Lower-dimensional representation capturing data essentials. Learned through autoencoders or GANs. Enables operations like vector arithmetic on semantic concepts (king - man + woman = queen).

layer

Neural network component transforming inputs through parameterized operations. Types:

Dense (linear + activation)
Convolutional (spatial filters)
Attention (contextual weighting)
Normalization (distribution scaling)

Layers API (tf.layers)

Legacy TensorFlow interface for layer construction. Superseded by Keras layers but still found in older codebases. Provides basic layers like Dense, Conv2D with manual weight management.

leaf

Terminal node in decision trees producing final predictions. Contains class distribution (classification) or constant value (regression). Depth to leaves indicates model complexity.

Learning Interpretability Tool (LIT)

Interactive platform for model analysis. Features:

Attribution visualization
Counterfactual generation
Dataset slicing
Performance metrics

Supports text, image, and tabular models.

learning rate

Most critical hyperparameter controlling step size in gradient descent. Too high causes divergence; too low slows convergence. Adaptive methods (Adam) automate per-parameter adjustments.

least squares regression

Linear modeling technique minimizing sum of squared residuals. Closed-form solution: $β = (X^{T} X)^{- 1} X^{T} y$ . Assumptions: linearity, homoscedasticity, independence, normality. Foundation for econometrics.

Levenshtein Distance

Edit distance measuring minimum single-character operations (insert, delete, substitute) to transform strings. Used in spell check, DNA alignment, OCR correction.

linear

Mathematical relationship expressible through addition and scalar multiplication. Linear models have additive, non-interacting features. Contrast with nonlinear relationships requiring polynomial terms or neural networks.

linear model

Predictive function of form $\hat{y} = w \cdot x + b$ . Limited to modeling additive effects but highly interpretable. Basis for many statistical methods: regression, SVMs with linear kernels.

linear regression

Continuous outcome modeling using linear predictor. Evaluates feature importance through coefficients. Diagnostic metrics: R², residual plots, p-values. Foundation for econometrics.

LIT

→ See Learning Interpretability Tool

LLM

→ See large language model

LLM evaluations (evals)

Assessment protocols for large language models covering:

Factual accuracy (TruthfulQA)
Reasoning (GSM8K)
Toxicity (RealToxicityPrompts)
Instruction following (AlpacaEval)

Require combination of automated and human assessment.

logistic regression

Probabilistic classification using sigmoid function. Decision boundary linear in feature space. Optimizes cross-entropy loss via gradient descent. Foundation for neural network classifiers.

logits

Unnormalized output values before applying final activation (softmax/sigmoid). Represent relative confidence scores. Used in loss calculations to avoid saturation issues.

Log Loss

(Binary Cross-Entropy) Penalizes incorrect confidence estimates. Formula: $- (y \log (p) + (1 - y) \log (1 - p))$ . Strongly convex, provides smooth gradient flow for probabilistic models.

log-odds

Logarithm of odds ratio: $\log (\frac{p}{1 - p})$ . Linear in logistic regression. Inverse of sigmoid function. Enables interpretation of coefficients as multiplicative odds changes.

Long Short-Term Memory (LSTM)

RNN variant with gated memory cells. Components:

Forget gate (discards irrelevant info)
Input gate (updates cell state)
Output gate (controls exposure)

Handles long-range dependencies better than vanilla RNNs.

LoRA

(Low-Rank Adaptation) Parameter-efficient fine-tuning using rank-decomposed weight updates. Freezes pretrained weights, injects trainable low-rank matrices. Reduces memory usage by >90% versus full fine-tuning.

loss

Quantitative measure of prediction error. Training minimizes loss through parameter updates. Common types:

Task-specific (cross-entropy)
Regularization (L1/L2)
Composite (triplet loss)

loss aggregator

Combination method for multi-task learning losses:

Weighted sum
Dynamic weighting (uncertainty)
GradNorm

Balances competing objectives during optimization.

loss curve

Training trajectory visualization showing loss vs iterations. Patterns indicate:

Learning rate suitability
Overfitting/underfitting
Convergence

Critical for debugging model behavior.

loss function

Objective function quantifying model performance. Design considerations:

Differentiability
Robustness to outliers
Scale sensitivity
Task alignment (e.g., IoU for detection)

loss surface

High-dimensional error landscape shaped by model parameters. Optimization navigates this terrain seeking minima. Visualization techniques: PCA slices, random projections.

Low-Rank Adaptability (LoRA)

→ See LoRA

LSTM

→ See Long Short-Term Memory