Machine Learning Glossary/S

Verfasst von / Written by Sebastian F. Genter

Machine Learning Glossary

S

sampling bias

responsible

Sampling bias is a type of bias that occurs when the process of collecting data results in a dataset that is not representative of the true distribution of the phenomenon or population being studied. This can happen if certain members of the population are more likely to be included in the sample than others. For example, a dataset collected only from users of a specific type of smartphone would exhibit sampling bias if the goal is to build a model that generalizes to all smartphone users. Sampling bias is a major concern because it can lead to models that perform poorly or unfairly when applied to the broader population.

sampling with replacement

fundamentals

Sampling with replacement is a method of sampling from a dataset where, after an item is selected and included in the sample, it is returned to the original dataset and can be selected again. This means that the same item can appear multiple times in a single sample. This method is a core component of bagging algorithms, such as Random Forests, where bootstrap samples are created by sampling with replacement from the training dataset.

SavedModel

TensorFlow

The recommended format for saving and loading TensorFlow models. A SavedModel contains a complete TensorFlow program, including the model's parameters (the weights and biases) and the computation necessary to perform inference. SavedModels can be loaded and run on different platforms (like servers, mobile devices, or edge devices) and with different APIs (Python, C++, Java, etc.), making them highly portable for serving trained models in production.

Saver

TensorFlow

In older versions of TensorFlow (specifically TensorFlow 1.x), the Saver was a class used to save and restore model variables to and from checkpoint files. It allowed users to save the progress of training by periodically saving the values of the model's variables, enabling the training process to be interrupted and resumed later. In TensorFlow 2.x and later, the recommended way to save and load models is using the SavedModel format and Keras APIs, which have largely superseded the direct use of the `Saver` class.

scalar

TensorFlow

In mathematics and machine learning, a scalar is a single numerical value, as opposed to a vector, matrix, or Tensor with more than one dimension. A scalar has a rank of 0.

scaling

fundamentals

Scaling is a data preprocessing technique used to transform the values of numerical features to a standardized range. This is often done to ensure that no single feature dominates the learning process due to its larger magnitude. Common scaling methods include Z-score normalization (standardization), which scales values to have a mean of 0 and a standard deviation of 1, and Min-Max scaling, which scales values to a fixed range, typically between 0 and 1. Scaling is an important step for many machine learning algorithms, particularly those sensitive to the scale of input features, such as Support Vector Machines and gradient descent-based algorithms.

scikit-learn

A popular open-source Python library for machine learning. Scikit-learn provides a wide range of efficient tools for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. It is known for its consistent API and ease of use, making it a widely adopted library for both beginners and experienced practitioners in the field. Consult the [user guide] for more information.

scoring

In machine learning, scoring typically refers to the process of using a trained model to generate predictions on new, unseen data. This is synonymous with inference. The term scoring is also sometimes used more broadly to refer to the process of evaluating a model's performance using various metrics.

selection bias

responsible

Selection bias is a type of bias that arises from the way in which data is selected for a dataset. It occurs when the process for including or excluding data results in a dataset that is not representative of the population or phenomenon the model is intended to generalize to. Types of selection bias include coverage bias, sampling bias, and participation bias (non-response bias). Selection bias is a major concern for fairness and can lead to models that perform poorly or unfairly on certain subgroups or in real-world scenarios.

self-attention (also called self-attention layer)

attention

Self-attention is an attention mechanism that allows a model to weigh the importance of different parts of the same input sequence when processing a particular element in that sequence. In a Transformer model, the self-attention layer enables each token in a sequence to attend to all other tokens in the same sequence, learning contextual relationships and dependencies regardless of the distance between tokens. This is achieved by computing attention scores based on the relationships between query, key, and value vectors derived from the input sequence.

self-supervised learning

A type of machine learning where the model learns to predict parts of its input from other parts of its input, without requiring explicitly provided human-labeled labels. In self-supervised learning, the data itself provides the supervision signal. For example, a language model can be trained to predict the next word in a sentence given the preceding words, where the "label" is the actual next word. This approach leverages large amounts of readily available unlabeled data to pre-train powerful models that can then be fine-tuned on smaller labeled datasets for specific downstream tasks.

self-training

A semi-supervised machine learning technique where a model is initially trained on a small amount of labeled data. Then, the model is used to make predictions on unlabeled data. High-confidence predictions on the unlabeled data are treated as if they were true labels, and these "pseudo-labeled" examples are added to the training set. The model is then retrained on the augmented dataset. This iterative process allows the model to potentially learn from a larger pool of data than initially available with labels, though it can also amplify errors if the pseudo-labels are incorrect.

semi-supervised learning

A type of machine learning that combines aspects of both supervised learning and unsupervised learning. In semi-supervised learning, the model is trained on a dataset that contains a relatively small amount of labeled data and a large amount of unlabeled data. The model learns from the labeled data to understand the prediction task and uses the unlabeled data to discover underlying patterns, structures, or the distribution of the data. Techniques like self-training and co-training fall under the umbrella of semi-supervised learning.

sensitive attribute

responsible

An attribute that could potentially be used to unfairly discriminate against individuals or groups. Sensitive attributes often relate to protected characteristics such as race, ethnicity, gender, age, religion, disability, or sexual orientation. In machine learning, care must be taken to ensure that models do not make predictions or decisions that are unfairly biased based on these sensitive attributes, either directly or indirectly through proxies. Addressing bias related to sensitive attributes is a key focus of fairness in machine learning.

sentiment analysis

A machine learning task focused on determining the emotional tone or opinion expressed in a piece of text. Sentiment analysis models are trained to classify text (like product reviews, social media posts, or customer feedback) as expressing positive, negative, or neutral sentiment. This is often treated as a classification task.

sequence model

fundamentals

A type of model designed to process or generate sequences of data. Sequence models can handle inputs where the order of the data points is significant, such as text, speech, time series data, or DNA sequences. Examples of sequence models include recurrent neural networks (RNNs), LSTMs, and Transformers. These models are capable of learning dependencies and patterns that extend over potentially long sequences.

sequence-to-sequence task

A machine learning task where the input to the model is a sequence of data and the output is also a sequence of data. In a sequence-to-sequence task, the input and output sequences may have different lengths. A classic example is machine translation, where an input sequence of words in one language is translated into an output sequence of words in another language. Other examples include text summarization and speech recognition. Sequence-to-sequence models often employ an encoder-decoder architecture.

serving

fundamentals

The process of using a trained machine learning model to make predictions on new, unseen data in a production environment. Serving involves deploying the trained model and providing an interface through which it can receive input data and return predictions. This is synonymous with inference and scoring. Serving can be done in an online manner (real-time, on-demand predictions) or offline (batch predictions).

shape (Tensor)

TensorFlow

In TensorFlow and other numerical libraries, the shape of a Tensor refers to the number of elements in each of its dimensions. It is represented as a list or tuple of integers. For example, a vector with 10 elements has a shape of (10,). A matrix with 3 rows and 4 columns has a shape of (3, 4). A Tensor with shape (2, 5, 7) is a 3-dimensional Tensor with 2 elements in the first dimension, 5 in the second, and 7 in the third. Understanding the shape of Tensors is crucial for performing operations and ensuring compatibility between different parts of a model.

shard

TensorFlow

To shard means to split a large dataset or model into smaller partitions or pieces. Sharding is commonly used in distributed training setups where the data or model is too large to fit into the memory of a single machine or device. The shards can then be processed in parallel across multiple hosts or devices.

shrinkage

regularization

In regularization techniques like L2 regularization (Ridge), shrinkage refers to the process of reducing the magnitude of the model's weights towards zero. Regularization penalties encourage smaller weights, effectively shrinking their values. Shrinkage helps to create simpler models and can reduce the impact of noise in the training data, thereby mitigating overfitting.

sigmoid function

fundamentals
activation

The sigmoid function is a mathematical function commonly used as an activation function in neural networks, particularly in the output layer for binary classification tasks. The function is defined as: $$ f(x) = \frac{1}{1 + e^{-x}} $$ The sigmoid function takes any real-valued input and squashes it to an output value between 0 and 1. This makes it suitable for outputting probabilities. However, it suffers from the vanishing gradient problem for very large or very small input values, which can hinder the training of deep networks. It is also known as the logistic function.

similarity measure

fundamentals

A function or metric used to quantify how alike two data points, vectors, or sequences are. A higher similarity measure indicates greater resemblance. Common similarity measures include cosine similarity (which measures the cosine of the angle between two vectors) and Euclidean distance (though this is a distance measure, its inverse or negation can represent similarity). Similarity measures are fundamental in various machine learning tasks, such as clustering, recommendation systems, and one-shot learning.

single program / multiple data (SPMD)

TensorFlow

Single Program, Multiple Data (SPMD) is a parallel computing architecture where multiple processors simultaneously execute the same program on different pieces of data. In machine learning, the SPMD model is often used in distributed [[Machine_Learning_Glossary/T#training|training)], where copies of the model and training algorithm run on multiple devices (like GPUs or TPUs), each processing a different subset of the data. JAX functions like pmap and pjit are designed to facilitate SPMD programming.

size invariance

vision

An attribute of an image model that means the model can successfully classify images even if the size of the image changes. For example, a model trained to recognize a cat should ideally still recognize the cat regardless of how large or small the cat appears in the image. Achieving true size invariance is a goal in computer vision, though perfect invariance is difficult.

sketching

A technique in machine learning used to create a compact summary or representation of a larger dataset. Sketching involves using probabilistic or deterministic methods to reduce the dimensionality or size of the data while preserving important properties. This is particularly useful for handling massive datasets that do not fit into memory, enabling more efficient computation and analysis.

skip-gram

language

A model architecture used in natural language processing, particularly in algorithms like Word2Vec, for learning word embeddings. The skip-gram model is trained to predict the surrounding context words within a defined window, given a target word. For example, given the word "cat", the skip-gram model would try to predict words like "the", " बैठा ", "on", and "mat" if they appear within the context window. This contrasts with the Continuous Bag of Words (CBOW) model, which predicts the target word from its context. By predicting context words from the target word, the skip-gram model learns word embeddings where words that appear in similar contexts are mapped to similar points in the embedding space.

softmax

fundamentals
activation

The softmax function is a mathematical function commonly used as an activation function in the output layer of neural networks for multi-class classification tasks. The softmax function takes a vector of real-valued scores (called logits) and transforms them into a probability distribution over the possible classes. The output values are all between 0 and 1, and they sum up to 1. Each output value represents the predicted probability that the input instance belongs to the corresponding class. The formula for the softmax function for a vector $z = [z_1, z_2, \dots, z_K]$ is: $$ \text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}} $$ where $K$ is the number of classes.

soft prompt tuning

Synonym for prompt tuning.

sparse feature

fundamentals

A feature where the values are predominantly zero or empty for most data examples. Sparse features are common when dealing with categorical data that has been one-hot encoded or with data involving counts of rare events. For example, in a bag of words representation of text, most words will be absent from any given document, resulting in a sparse feature vector. Handling sparse features efficiently is important for both memory usage and computational speed, often requiring specialized data structures and algorithms designed for sparsity.

sparse representation

fundamentals

A method for storing data where only the non-zero elements are explicitly stored, along with their indices. This is in contrast to a dense representation where all elements, including zeros, are stored. Sparse representation is used to efficiently store and manipulate sparse features or sparse vectors and matrices, significantly reducing memory usage and potentially speeding up computations by avoiding operations on zero values.

sparse vector

fundamentals

A vector in which the vast majority of elements have a value of zero. Sparse vectors are a common outcome of techniques like one-hot encoding for categorical data with a large number of categories, or in representations like bag of words. Efficiently storing and processing sparse vectors requires specialized data structures and algorithms that take advantage of the sparsity.

sparsity

fundamentals

A characteristic of data, features, or model parameters where a large proportion of the values are zero. Sparsity is common in various types of data, such as text data (most words are absent from a given document) or categorical features with high cardinality (resulting in sparse one-hot encoded vectors). In models, sparsity can refer to a large number of model weights being zero, which can be encouraged through techniques like L1 regularization. Leveraging sparsity through sparse representations and specialized algorithms is crucial for efficient storage and computation.

spatial pooling

vision

In convolutional neural networks, spatial pooling is a technique used to reduce the spatial dimensions (width and height) of the feature maps generated by convolutional layers. This is synonymous with pooling. Spatial pooling operations summarize regions of the input feature map into a single output value, reducing the overall size and computational cost of the network.

split

decisionForests

A synonym for condition in a decision tree. A split is the decision rule applied at a node that partitions the data into two or more subsets based on the value of a feature or a combination of features.

splitter

decisionForests

In decision tree training, the splitter is the component or algorithm responsible for finding the best split (condition) at each node. The splitter evaluates different possible splits based on criteria like information gain or gini impurity to determine the split that best separates the data according to the target labels.

SPMD

See single program / multiple data (SPMD).

squared hinge loss

Metric

A variation of the hinge loss function used in machine learning, particularly for classification tasks. The squared hinge loss is defined as: $$ \text{Squared Hinge Loss} = \max(0, 1 - y \cdot \hat{y})^2 $$ where $y$ is the true label (-1 or 1 for binary classification) and $\hat{y}$ is the model's raw output (before applying a sign function). Like standard hinge loss, it penalizes predictions that are on the wrong side of the decision boundary or not confident enough on the correct side. The squaring of the term $\max(0, 1 - y \cdot \hat{y})$ makes the penalty grow quadratically with the error, making it more sensitive to large errors than the linear hinge loss.

squared loss

Metric

A loss function that penalizes predictions based on the square of the difference between the predicted value and the actual value. Squared loss is synonymous with L2 loss and Mean Squared Error (MSE). It is a widely used loss function in regression problems.

staged training

A technique used in training models, particularly large or complex ones, where the training process is divided into multiple consecutive stages. In staged training, different parts of the model might be trained or fine-tuned in separate stages, or the learning rate or other hyperparameters might be adjusted across stages. For example, the RAG architecture is often trained in stages, first training the retriever and then training the generator while keeping the retriever frozen. This allows for more focused optimization of different components or aspects of the model. See also pipelining.

state

reinforcementLearning

In reinforcement learning, the state ($s$) represents the current configuration or situation of the environment that the agent is in. The state provides the agent with the necessary information to decide which action to take. The state can be represented in various ways, depending on the nature of the environment, such as the position of pieces on a chessboard, the sensor readings from a robot, or the current frame in a video game.

state-action value function

See Q-function. The state-action value function is another name for the Q-function, which estimates the expected future return of taking a specific action in a given state and then following an optimal policy.

static

fundamentals

Synonym for offline. In machine learning, static typically refers to processes or data that are fixed or unchanging over time, as opposed to dynamic processes or streaming data.

static inference

fundamentals

Synonym for offline inference. Static inference involves generating a batch of predictions at once and caching them, rather than making predictions on demand.

stationarity

fundamentals

A property of a distribution where the statistical properties (such as mean and variance) do not change over time or location. In machine learning, stationarity is often assumed for data, particularly in time series analysis, but real-world data frequently exhibits nonstationarity, where the data distribution changes over time. Models trained on stationary data may perform poorly on nonstationary data unless they are specifically designed or adapted to handle such shifts.

step

TensorFlow

In TensorFlow, a step (also known as a training step or global step) refers to a single iteration of the training loop. During each step, a mini-batch of data is typically processed, the loss is computed, the gradients are calculated, and the model's parameters are updated by the optimizer. The total number of training steps is a common way to measure the progress of training.

step size

Synonym for learning rate. In gradient descent and other optimization algorithms, the step size determines how large of a movement is made in the direction of the negative gradient during each training step.

stochastic gradient descent (SGD)

optimization

Stochastic Gradient Descent (SGD) is a widely used optimization algorithm for training machine learning models, particularly neural networks. Unlike standard Gradient Descent which computes the gradient of the loss function over the entire training dataset before making a parameter update, SGD computes the gradient and updates the parameters for each individual training example at a time (or more commonly, on mini-batches of examples in mini-batch SGD). This stochastic (random) approach introduces noise into the gradient estimates but is much more computationally efficient for large datasets and can help escape shallow local minima.

stride

fundamentals

In convolutional neural networks and pooling operations, the stride refers to the number of pixels that the convolution filter or the pooling window shifts over the input matrix or Tensor at each step. A stride of 1 means the filter or window moves one pixel at a time. A stride of 2 means it moves two pixels at a time, effectively downsampling the spatial dimensions of the output.

structural risk minimization (SRM)

A principle in machine learning that aims to find a balance between the complexity of a model and its ability to fit the training data. Structural Risk Minimization (SRM) seeks to minimize an upper bound on the expected error (risk) on unseen data, rather than just minimizing the error on the training data (empirical risk minimization). This involves selecting a model from a nested set of function classes, where each class represents a different level of complexity, and choosing the class that minimizes the sum of the empirical risk and a term that penalizes complexity. SRM is related to regularization and the concept of the VC dimension.

subsampling

Synonym for downsampling. Subsampling typically refers to reducing the size or resolution of data, such as images or audio signals, or reducing the number of examples in a dataset. In convolutional neural networks, pooling is sometimes informally referred to as subsampling.

subword token

language

In natural language processing, a subword token is a unit of text that is smaller than a complete word but can be combined with other subword tokens to form words. Examples include morphemes (like "un-" or "-ing") or frequently occurring character sequences. Using subword tokens (such as those generated by algorithms like WordPiece or Byte Pair Encoding) allows language models to handle out-of-vocabulary words by breaking them down into known subword units. This is particularly useful for languages with rich morphology or for handling misspellings and rare words.

summary

A concise representation of a larger piece of content, such as a text document or an image. In machine learning, a summary can refer to:

A condensed textual representation of a longer document, a common task in natural language processing.
A statistic that provides a concise description of a set of data, such as the mean, median, or standard deviation.

supervised machine learning

fundamentals

A major paradigm in machine learning where a model learns from a dataset containing labeled examples. In supervised machine learning, each example in the training dataset is paired with the correct output or label. The model's objective is to learn a mapping from the input features to the correct labels. This learned mapping can then be used to make predictions on new, unseen data. Common supervised learning tasks include classification and regression.

synthetic feature

featureEngineering

A feature created by combining or transforming one or more existing features. Synthetic features are part of the feature engineering process and can help models learn more complex patterns than would be possible with the original features alone. Examples include creating a feature cross by multiplying two features, or combining features through polynomial transformations.