Verfasst von / Written by Sebastian F. Genter
B
backpropagation
The fundamental algorithm for training neural networks through gradient calculation. Comprises two phases:
- Forward pass: Compute predictions and loss
- Backward pass: Calculate gradients using chain rule from calculus
Automatically adjusts weights across all network layers. Modern frameworks handle this automatically, unlike early manual implementations.
bagging
(Bootstrap Aggregating) Ensemble technique creating multiple models from random data subsets sampled with replacement. Key characteristics:
- Reduces variance through model diversity
- Aggregates predictions via voting (classification) or averaging (regression)
- Core component of random forests (ensembles of decision trees)
bag of words
Text representation method ignoring word order while preserving frequency. Examples:
- "the dog jumps" = [1,1,1,0,...]
- "jumps the dog" = [1,1,1,0,...] (identical representation)
Used in early NLP systems with sparse vector encodings. Evolved into TF-IDF and n-gram variants.
baseline
Reference model establishing minimum performance expectations. Common baselines:
- Majority class classifier for imbalanced data
- Linear regression vs deep networks
- Previous system version in production
Helps quantify improvements from new approaches.
batch
Group of examples processed together during model training/inference. Benefits:
- Hardware-friendly matrix operations
- Smoother gradient estimates
- Memory efficiency vs single-example processing
batch inference
Parallel prediction method dividing inputs into subgroups for simultaneous processing. Particularly effective on:
- TPU/GPU accelerators
- Large-scale prediction tasks
Contrasts with real-time online inference requiring immediate responses.
batch normalization
Normalization technique applied to layer inputs/outputs. Key benefits:
- Enables higher learning rates
- Reduces internal covariate shift
- Acts as mild regularizer
Implemented by adjusting mean/variance per batch during training.
batch size
Critical hyperparameter controlling:
- 1 (Stochastic GD): High variance updates
- Full dataset: Computationally expensive
- 32-512 (Common mini-batch): Balance of efficiency/stability
Affects memory usage and convergence speed.
Bayesian neural network
Probabilistic networks capturing uncertainty through:
- Weight distributions vs fixed values
- Predictive confidence intervals
Particularly valuable in medical diagnosis and risk-sensitive applications.
Bayesian optimization
Smart hyperparameter search strategy using:
- Surrogate model (Gaussian Process)
- Acquisition function (Upper Confidence Bound)
Efficiently explores parameter space with few evaluations.
Bellman equation
Foundational RL equation defining optimal value functions: Forms basis for Q-learning updates in temporal difference learning.
BERT
(Bidirectional Encoder Representations from Transformers) Breakthrough NLP model featuring:
- Transformer encoder architecture
- Masked language modeling pretraining
- Context-aware word embeddings
Revolutionized transfer learning for text tasks.
bias (ethics/fairness)
Systematic errors in ML systems categorized as:
- Cognitive biases (confirmation, in-group)
- Measurement biases (sampling, reporting)
Critical consideration in responsible AI development.
bias (math)
Model's baseline output when all features are zero. In :
- represents bias term
- Enables model flexibility beyond origin constraints
bidirectional
Processing context from both directions in sequences. Example: Unidirectional: Predicts "____" in "The ____ jumped" using left context Bidirectional: Uses "jumped" from right context for better prediction
bidirectional language model
Contextual language models using full sentence context. Handles challenges like:
- Pronoun resolution ("He" refers to antecedent)
- Polysemy ("bank" as financial vs river)
bigram
Pair of consecutive tokens. Fundamental unit in:
- Language modeling ("New York")
- Text generation
- Basic spelling correction
binary classification
Two-class prediction tasks with metrics:
- Precision/Recall
- ROC AUC
- F1 Score
Common applications: Spam detection, medical diagnosis.
binary condition
Decision tree splits producing two paths. Example:
IF temperature ≥ 100°C:
THEN "High risk"
ELSE:
Proceed to next condition
binning
Converting continuous features to categorical ranges. Example:
- Age → [0-12, 13-19, 20-64, 65+]
Trade-off: Gains nonlinear handling at cost of increased dimensionality.
BLEU
(Bilingual Evaluation Understudy) Translation quality metric:
- Measures n-gram overlap (1-4 grams)
- Penalizes short translations
- Scale 0-1 (1=perfect match)
Limited in handling semantic equivalence.
BLEURT
(BERT-based Evaluation Metric) Advanced translation assessment:
- Uses BERT embeddings
- Understands paraphrases
- Better human correlation
Requires pretraining on human ratings.
boosting
Ensemble method converting weak learners to strong predictors via:
- Sequential error correction
- Example reweighting
Famous implementations: AdaBoost, Gradient Boosted Trees
bounding box
Rectangular image coordinates specifying object location. Format:
- Top-left (x1,y1)
- Bottom-right (x2,y2)
Critical for object detection evaluation using IoU (Intersection over Union).
broadcasting
Array operation technique in NumPy/TensorFlow:
- Automatically aligns dimensions
- Expands smaller arrays
Example: Adding scalar to matrix without explicit replication.
bucketing
(See binning) Alternative term for converting continuous values to discrete ranges through:
- Fixed intervals
- Quantile-based divisions
- Domain knowledge thresholds