Machine Learning Glossary/B

Verfasst von / Written by Sebastian F. Genter

Machine Learning Glossary

B

backpropagation

The fundamental algorithm for training neural networks through gradient calculation. Comprises two phases:

Forward pass: Compute predictions and loss
Backward pass: Calculate gradients using chain rule from calculus

Automatically adjusts weights across all network layers. Modern frameworks handle this automatically, unlike early manual implementations.

bagging

(Bootstrap Aggregating) Ensemble technique creating multiple models from random data subsets sampled with replacement. Key characteristics:

Reduces variance through model diversity
Aggregates predictions via voting (classification) or averaging (regression)
Core component of random forests (ensembles of decision trees)

bag of words

Text representation method ignoring word order while preserving frequency. Examples:

"the dog jumps" = [1,1,1,0,...]
"jumps the dog" = [1,1,1,0,...] (identical representation)

Used in early NLP systems with sparse vector encodings. Evolved into TF-IDF and n-gram variants.

baseline

Reference model establishing minimum performance expectations. Common baselines:

Majority class classifier for imbalanced data
Linear regression vs deep networks
Previous system version in production

Helps quantify improvements from new approaches.

batch

Group of examples processed together during model training/inference. Benefits:

Hardware-friendly matrix operations
Smoother gradient estimates
Memory efficiency vs single-example processing

batch inference

Parallel prediction method dividing inputs into subgroups for simultaneous processing. Particularly effective on:

TPU/GPU accelerators
Large-scale prediction tasks

Contrasts with real-time online inference requiring immediate responses.

batch normalization

Normalization technique applied to layer inputs/outputs. Key benefits:

Enables higher learning rates
Reduces internal covariate shift
Acts as mild regularizer

Implemented by adjusting mean/variance per batch during training.

batch size

Critical hyperparameter controlling:

1 (Stochastic GD): High variance updates
Full dataset: Computationally expensive
32-512 (Common mini-batch): Balance of efficiency/stability

Affects memory usage and convergence speed.

Bayesian neural network

Probabilistic networks capturing uncertainty through:

Weight distributions vs fixed values
Predictive confidence intervals

Particularly valuable in medical diagnosis and risk-sensitive applications.

Bayesian optimization

Smart hyperparameter search strategy using:

Surrogate model (Gaussian Process)
Acquisition function (Upper Confidence Bound)

Efficiently explores parameter space with few evaluations.

Bellman equation

Foundational RL equation defining optimal value functions: $Q (s, a) = r (s, a) + γ 𝔼 [\max_{a^{'}} Q (s^{'}, a^{'})]$ Forms basis for Q-learning updates in temporal difference learning.

BERT

(Bidirectional Encoder Representations from Transformers) Breakthrough NLP model featuring:

Transformer encoder architecture
Masked language modeling pretraining
Context-aware word embeddings

Revolutionized transfer learning for text tasks.

bias (ethics/fairness)

Systematic errors in ML systems categorized as:

Cognitive biases (confirmation, in-group)
Measurement biases (sampling, reporting)

Critical consideration in responsible AI development.

bias (math)

Model's baseline output when all features are zero. In $y = w x + b$ :

$b$ represents bias term
Enables model flexibility beyond origin constraints

bidirectional

Processing context from both directions in sequences. Example: Unidirectional: Predicts "____" in "The ____ jumped" using left context Bidirectional: Uses "jumped" from right context for better prediction

bidirectional language model

Contextual language models using full sentence context. Handles challenges like:

Pronoun resolution ("He" refers to antecedent)
Polysemy ("bank" as financial vs river)

bigram

Pair of consecutive tokens. Fundamental unit in:

Language modeling ("New York")
Text generation
Basic spelling correction

binary classification

Two-class prediction tasks with metrics:

Precision/Recall
ROC AUC
F1 Score

Common applications: Spam detection, medical diagnosis.

binary condition

Decision tree splits producing two paths. Example:

IF temperature ≥ 100°C:
    THEN "High risk"
ELSE:
    Proceed to next condition

binning

Converting continuous features to categorical ranges. Example:

Age → [0-12, 13-19, 20-64, 65+]

Trade-off: Gains nonlinear handling at cost of increased dimensionality.

BLEU

(Bilingual Evaluation Understudy) Translation quality metric:

Measures n-gram overlap (1-4 grams)
Penalizes short translations
Scale 0-1 (1=perfect match)

Limited in handling semantic equivalence.

BLEURT

(BERT-based Evaluation Metric) Advanced translation assessment:

Uses BERT embeddings

Understands paraphrases

Better human correlation

Requires pretraining on human ratings.

boosting

Ensemble method converting weak learners to strong predictors via:

Sequential error correction
Example reweighting

Famous implementations: AdaBoost, Gradient Boosted Trees

bounding box

Rectangular image coordinates specifying object location. Format:

Top-left (x1,y1)
Bottom-right (x2,y2)

Critical for object detection evaluation using IoU (Intersection over Union).

broadcasting

Array operation technique in NumPy/TensorFlow:

Automatically aligns dimensions
Expands smaller arrays

Example: Adding scalar to matrix without explicit replication.

bucketing

(See binning) Alternative term for converting continuous values to discrete ranges through:

Fixed intervals
Quantile-based divisions
Domain knowledge thresholds