Verfasst von / Written by Sebastian F. Genter
M
Machine Learning
A branch of artificial intelligence focused on developing systems that learn patterns from data without explicit programming. These systems improve automatically through experience, enabling tasks like image recognition, natural language processing, and predictive analytics. Common approaches include supervised, unsupervised, and reinforcement learning.
Machine Translation
The automated process of converting text or speech from one language to another using computational methods. Modern approaches often leverage neural networks trained on bilingual corpora to capture contextual nuances, improving fluency compared to rule-based systems.
Majority Class
In classification tasks with imbalanced datasets, the class that occurs most frequently. For example, in a spam detection dataset where 95% of emails are "not spam," the "not spam" label represents the majority class. Models may struggle to learn minority classes without techniques like resampling.
Markov Decision Process (MDP)
A mathematical framework for modeling decision-making in environments where outcomes are partially random and partially under the control of an agent. Key components include states, actions, transition probabilities, and rewards. Used extensively in reinforcement learning to optimize policies.
Markov Property
A property stating that the future state of a system depends only on the current state, not on the sequence of events preceding it. This simplification is critical for tractable solutions in MDPs and probabilistic models like Hidden Markov Models.
Masked Language Model
A type of language model trained to predict missing or masked tokens in a sequence. For example, given the input "The [MASK] sat on the mat," the model might predict "cat." BERT popularized this approach for pre-training contextual embeddings.
Matplotlib
A Python library for creating static, interactive, and animated visualizations. Widely used in machine learning for exploratory data analysis, plotting training curves, and visualizing model outputs like confusion matrices.
Matrix Factorization
A technique to decompose a matrix into lower-dimensional matrices to uncover latent patterns. In recommendation systems, it predicts user-item interactions by factorizing a user-item matrix into user and item embedding matrices, enabling personalized recommendations.
Mean Absolute Error (MAE)
A regression metric calculating the average absolute difference between predicted and true values. Formula: . Less sensitive to outliers than Mean Squared Error.
Mean Average Precision at k (mAP@k)
A ranking metric averaging precision@k across multiple queries. For recommender systems, it measures how often relevant items appear in the top-k results. Higher values indicate better ranking quality.
Mean Squared Error (MSE)
A regression loss function computing the average squared difference between predictions and true values. Formula: . Penalizes large errors more heavily than MAE.
Mesh
In distributed computing, a logical or physical arrangement of devices (e.g., TPUs) used to parallelize model training or inference. Defines how tensors are partitioned across accelerators for efficient computation.
Meta-Learning
Algorithms designed to "learn how to learn," enabling models to adapt quickly to new tasks with minimal data. Examples include model-agnostic meta-learning (MAML) and optimization-based approaches for few-shot learning.
Metric
A quantitative measure of model performance. Common examples include accuracy, F1 score, and ROC-AUC. Distinct from loss functions, metrics are often used for evaluation rather than optimization.
Metrics API (tf.metrics)
A TensorFlow module providing pre-implemented evaluation metrics (e.g., tf.metrics.accuracy). Simplifies tracking model performance during training and validation.
Mini-Batch
A subset of training data used in a single iteration of gradient descent. Balances computational efficiency (smaller batches) with stable convergence (larger batches). Typical sizes range from 32 to 1024 examples.
Mini-Batch Stochastic Gradient Descent
A gradient descent variant that updates model parameters using gradients computed on mini-batches. Combines the efficiency of stochastic gradient descent with the stability of batch gradient descent.
Minimax Loss
A loss function used in generative adversarial networks (GANs) where the generator and discriminator are trained adversarially. The generator aims to minimize the discriminator's ability to distinguish real from fake data.
Minority Class
The less frequent class in imbalanced datasets. For example, in fraud detection, fraudulent transactions might represent <1% of data. Techniques like SMOTE or weighted loss functions address underrepresentation.
Mixture of Experts (MOE)
A neural architecture where specialized submodels ("experts") handle different input patterns. A gating network routes inputs to relevant experts, improving model capacity without proportional computational cost.
ML
Abbreviation for Machine Learning.
MMIT
Multimodal Instruction-Tuned models fine-tuned to follow instructions across multiple data types (text, images, audio). Enhances versatility in tasks like visual question answering.
MNIST
A benchmark dataset of 70,000 handwritten digits (0-9). Each image is 28x28 pixels with corresponding labels. Commonly used for testing image classification algorithms.
Modality
A data type or format, such as text, image, audio, or video. Multimodal models integrate multiple modalities (e.g., CLIP for text-image alignment).
Model
A mathematical structure that maps inputs to outputs. Examples include decision trees (rule-based), neural networks (layered transformations), and regression models (linear relationships).
Model Capacity
The complexity a model can represent, often correlated with the number of parameters. High-capacity models (e.g., deep networks) may overfit without regularization.
Model Cascading
A deployment strategy using multiple models of varying complexity. Simple queries are handled by lightweight models, while complex ones trigger larger models, optimizing latency and resource usage.
Model Parallelism
Distributing a model across multiple devices (e.g., GPUs) to handle large architectures. Layers or tensors are split horizontally or vertically to fit memory constraints.
Model Router
A component in cascading systems that directs input to the most suitable model based on complexity thresholds or confidence scores.
Model Training
The process of adjusting model parameters to minimize prediction error. Involves forward passes, loss computation, backpropagation, and iterative optimization.
MOE
Abbreviation for Mixture of Experts.
Momentum
An optimization technique that accelerates gradient descent by incorporating a moving average of past gradients. Reduces oscillations and escapes shallow local minima.
MT
Abbreviation for Machine Translation.
Multi-Class Classification
Classification tasks with more than two possible labels. For example, categorizing images into "dog," "cat," or "bird." Output layers often use softmax activation.
Multi-Class Logistic Regression
An extension of logistic regression for multi-class problems. Uses softmax to produce probability distributions over multiple classes. Also called multinomial regression.
Multi-Head Self-Attention
A Transformer mechanism where self-attention is applied multiple times in parallel. Each "head" learns different attention patterns, capturing diverse contextual relationships in sequences.
Multimodal Instruction-Tuned
Models trained to follow instructions involving multiple data types (e.g., "Describe this image in French"). Combines cross-modal understanding with task-specific fine-tuning.
Multimodal Model
A model processing and/or generating multiple data types. Examples include DALL-E (text-to-image) and Whisper (speech-to-text). Enhances capabilities in cross-modal tasks.
Multinomial Classification
Synonym for multi-class classification. Predicts one label from three or more possible classes using techniques like one-vs-rest or softmax regression.
Multinomial Regression
Another term for multi-class logistic regression. Models the probability of each class using a linear combination of input features, normalized via softmax.
Multitask
Training a single model on multiple related tasks. Shares representations across tasks, improving data efficiency and generalization. Example: joint training for named entity recognition and part-of-speech tagging.