Machine Learning Glossary/P

Verfasst von / Written by Sebastian F. Genter

Machine Learning Glossary

P

packed data

fundamentals

Packed data refers to an approach for storing data in a more efficient manner. This involves storing data either through the use of a compressed format or by organizing it in some other way that allows for more efficient access. The primary goal of using packed data is to minimize the amount of memory and computational resources required to access the data, which in turn leads to faster training and more efficient model inference. Packed data is often employed in conjunction with other techniques, such as data augmentation and regularization, to further enhance the performance of models.

pandas

A widely-used Python library that provides a column-oriented data analysis API. It is built upon the NumPy library. Many machine learning frameworks, including TensorFlow, offer support for utilizing pandas data structures as inputs for models. For detailed information, consult the [documentation].

parameter

Parameters are the weights and biases that a model iteratively learns and adjusts during the training process. For instance, in a linear regression model, the parameters include the bias term ($b$) and all the individual weights ($w_1, w_2,$ and so on) as represented in the model's formula. These are distinct from hyperparameters, which are values that you (or a hyperparameter tuning service) explicitly set and provide to the model before or during training. An example of a hyperparameter is the learning rate.

parameter-efficient tuning

A collection of techniques designed to fine-tune a large pre-trained language model (PLM) in a more computationally efficient manner compared to performing full fine-tuning. Parameter-efficient tuning methods typically involve fine-tuning a significantly smaller number of parameters than required for full fine-tuning. Despite adjusting fewer parameters, these techniques generally result in a large language model that achieves comparable performance (or nearly comparable) to one obtained through full fine-tuning. This approach can be compared and contrasted with instruction tuning and prompt tuning. Parameter-efficient tuning is also known as parameter-efficient fine-tuning.

Parameter Server (PS)

A role or job within a distributed machine learning setting that is responsible for keeping track of and managing the parameters of a model. Parameter Servers are crucial in distributed training environments where the model or data is sharded across multiple machines or devices.

parameter update

The procedural operation of adjusting the parameters (the weights and biases) of a model during the training process. These adjustments typically occur within each single iteration of a gradient descent algorithm, based on the computed gradients of the loss function.

partial derivative

In calculus, a partial derivative is the derivative of a function with respect to one of its variables, treating all other variables as constants. For a function $f(x, y)$ with variables $x$ and $y$, the partial derivative with respect to $x$ focuses solely on how the function $f$ changes as $x$ changes, while $y$ is held constant. In machine learning, partial derivatives are fundamental to gradient descent and backpropagation algorithms, where they are calculated to determine how changes in each individual parameter (weights and biases) affect the loss function.

participation bias

Synonym for non-response bias. This is a type of bias that occurs when the participants in a dataset are not representative of the larger population that the model will be applied to, often because certain groups are less likely to participate or be included in the data collection process. See selection bias.

partitioning strategy

The specific algorithm or method used to divide variables (such as parameters) and/or data across multiple parameter servers or devices in a distributed machine learning setup. This strategy determines how the workload and data are distributed to enable parallel processing.

pass at k (pass@k)

A metric specifically used to evaluate the quality of generated code (such as Python code) produced by a large language model. Pass at k measures the likelihood that at least one correct and functional block of code will be found among the first $k$ code blocks generated by the LLM when attempting to solve a coding problem. This is typically determined by running automated unit tests against the generated code. If one or more of the $k$ generated solutions pass the unit tests for a given coding challenge, the LLM is considered to have "Passed" that challenge. If none of the $k$ solutions pass, it "Fails" the challenge. The formula for pass at k is the total number of challenges passed divided by the total number of challenges attempted. Higher values of $k$ generally lead to higher pass at k scores, as the model has more attempts to produce a correct solution, but this also requires more computational resources.

Pax

A programming framework designed specifically for the training of very large-scale neural network models that are so substantial they need to be spread across multiple TPU accelerator chip slices or pods. Pax is built upon the Flax library, which itself is built on JAX. It provides an infrastructure for constructing and training massive models efficiently in a distributed computing environment.

perceptron

A fundamental component in machine learning, conceptualized as a system (which can be implemented in hardware or software) that receives one or more input values. The perceptron computes a weighted sum of its inputs and then passes this sum through a function to produce a single output value. In the context of neural networks, this function is typically a nonlinear activation function, such as ReLU, sigmoid, or tanh. Perceptrons serve as the basic building blocks, or neurons, within neural networks.

performance

An overloaded term with two primary meanings in the context of software and machine learning. In standard software engineering, performance typically refers to how fast or efficiently a piece of software executes. In machine learning, however, performance specifically addresses the question of how correct or accurate a model's predictions are. It quantifies the quality and effectiveness of the model's output on a given task.

permutation variable importances

A type of variable importance metric used to assess the significance of individual features in a model's predictions. It works by evaluating how much the model's prediction error increases when the values of a specific feature are randomly shuffled or permuted. A larger increase in error indicates that the permuted feature was more important to the model's performance. Permutation variable importance is notable for being a model-independent metric, meaning it can be applied to any model regardless of its internal structure.

perplexity

A metric used to measure how well a language model is performing its task, particularly in predicting sequences of tokens. Informally, for a task like predicting the next word a user is typing, perplexity ($P$) can be understood as an approximation of the number of possible next words the model would need to suggest to have a high probability of including the actual next word the user intends to type. Perplexity is mathematically related to cross-entropy.

pipeline

In the context of machine learning, a pipeline refers to the end-to-end infrastructure and sequence of steps involved in taking raw data and ultimately deploying a trained model for production use. A typical pipeline encompasses stages such as data collection, preprocessing and transforming data into a suitable format for training, training one or more models, evaluating the models, and finally exporting the trained models to a production environment for inference. See [pipelines] for more information.

pipelining

A specific form of model parallelism where the processing of a single model is broken down into a series of consecutive stages. Each of these stages is then executed on a different device (such as a GPU or TPU core). The key feature of pipelining is that while one stage is actively processing a particular batch of data, the preceding stage in the pipeline can concurrently begin working on the next batch. This overlapping execution allows for increased throughput and efficiency in training or inference, especially for large models. See also staged training.

pjit

A function within the JAX library that is used to parallelize computation and split code execution across multiple accelerator chips. By passing a Python function to `pjit`, JAX compiles it into an XLA (Accelerated Linear Algebra) computation that can run in parallel on multiple devices (like GPUs or TPU cores). `pjit` facilitates sharding computations across devices without requiring significant code rewriting, often using the SPMD (Single Program, Multiple Data) partitioning strategy. As of March 2023, `pjit` functionality has been merged into the standard `jit` function in JAX.

PLM

An abbreviation for pre-trained language model. This term typically refers to a language model, especially a large language model, that has undergone an initial pre-training phase on a vast dataset.

pmap

A function within the JAX library that enables the execution of copies of an input function simultaneously on multiple underlying hardware devices (CPUs, GPUs, or TPUs), with each copy processing different input values. `pmap` relies on the SPMD (Single Program, Multiple Data) programming model, allowing for efficient parallel execution of functions across hardware devices.

policy

In the domain of reinforcement learning, a policy defines the behavior of an agent. It represents a probabilistic mapping from observed states of the environment to the actions that the agent might take in those states. The agent uses its policy to decide which action to take in any given situation within the environment, with the goal of maximizing its cumulative reward.

pooling

A technique used in convolutional neural networks to reduce the spatial dimensions (width and height) of the matrix or Tensors produced by earlier convolutional layers. Pooling operations typically involve dividing the input matrix into non-overlapping or overlapping regions and then taking either the maximum value (max pooling) or the average value (average pooling) within each region to create a smaller output matrix. This process helps to reduce the computational complexity and memory requirements of the network while also providing some degree of translational invariance to the learned features. Pooling for computer vision applications is also known as spatial pooling, while for time-series data it's called temporal pooling. Less formally, pooling is sometimes referred to as subsampling or downsampling.

positional encoding

A technique used primarily in Transformer models to inject information about the absolute or relative position of each token within a sequence into the embedding of that token. Since Transformers process sequences in parallel without inherent sequential processing like recurrent neural networks, positional encoding is essential for the model to understand the order of tokens and the relationships between them based on their positions. A common implementation uses sinusoidal functions to generate these positional signals, which are then added to the token embeddings.

positive class

In a binary classification problem, the positive class is the specific outcome or category that the model is designed to detect or identify. For example, in a medical test designed to detect cancer, "tumor" would be the positive class. In an email classification model, "spam" might be designated as the positive class. It's important to note that the term "positive" here doesn't necessarily imply a desirable outcome; it simply denotes the class of interest for the test. The positive class is contrasted with the negative class.

post-processing

responsible
fundamentals

The action of adjusting or modifying the output of a model after the model has completed its prediction task. Post-processing is often employed to enforce certain criteria or constraints on the model's output, particularly to address issues related to fairness without needing to alter the model's internal structure or training process. For example, in a binary classification scenario, post-processing could involve adjusting the classification threshold applied to the model's raw scores to ensure that a fairness metric, such as equality of opportunity, is satisfied across different sensitive attributes.

post-trained model

language
image
generativeAI

A term that is loosely defined but typically refers to a pre-trained model that has undergone one or more subsequent training or adjustment steps beyond its initial pre-training phase. These post-training steps can include techniques such as Distillation, Fine-tuning, or Instruction tuning. The purpose of these steps is usually to adapt or refine the pre-trained model for specific downstream tasks or improve certain aspects of its performance or behavior.

PR AUC (area under the PR curve)

Metric

PR AUC is a metric that represents the area under the interpolated precision-recall curve. The precision-recall curve is created by plotting recall values on the x-axis against corresponding precision values on the y-axis, using different settings for the classification threshold in a binary classification task. PR AUC provides a single scalar value that summarizes the performance of a binary classification model across all possible thresholds, particularly useful for datasets with class imbalance.

Praxis

A core and high-performance library that is part of the Pax framework. Praxis is frequently referred to as the "Layer library" because it contains the fundamental definitions and components for building layers in neural networks. Beyond just layer definitions, Praxis also includes essential supporting components such as data inputs, configuration libraries (like HParam and Fiddle), and optimizers. Praxis also provides the definitions for the fundamental Model class within Pax.

precision

Metric

Precision is a metric used to evaluate the performance of classification models, particularly in binary classification. It answers the question: "When the model predicted the positive class, what percentage of those predictions were actually correct?". The formula for Precision is: $$ \text{Precision} = \frac{\text{true positives}}{\text{true positives} + \text{false positives}} $$ where:

True positives are cases where the model correctly predicted the positive class.
False positives are cases where the model mistakenly predicted the positive class (it was actually the negative class).

For example, if a model made 200 positive predictions, and 150 of them were true positives while 50 were false positives, the precision would be $150 / (150 + 50) = 150 / 200 = 0.75$. Precision is often considered alongside accuracy and recall.

precision at k (precision@k)

Metric

Precision at k is a metric specifically used for evaluating the quality of ranked (ordered) lists of items, often generated by systems like recommendation systems or ranking models. It measures the fraction of the first $k$ items in the ranked list that are considered "relevant". The formula is: $$ \text{Precision@k} = \frac{\text{Number of relevant items in the first k items}}{\text{k}} $$ The value of $k$ must be less than or equal to the total length of the generated list. Relevance is often a subjective judgment, which can sometimes lead to disagreements even among human evaluators. Precision at k is a common metric used in conjunction with average precision at k and mean average precision at k.

precision-recall curve

A graphical representation used to evaluate the performance of binary classification models. The precision-recall curve plots precision values on the y-axis against corresponding recall values on the x-axis, across different settings of the classification threshold. This curve is particularly informative for evaluating models on datasets with class imbalance, as it focuses on the trade-off between correctly identifying positive examples (recall) and avoiding false positives (precision).

prediction

The output generated by a model after processing input examples. The nature of the prediction depends on the type of model and the task it is designed for. For example:

A binary classification model outputs one of two possible classes (the positive class or the negative class).
A multi-class classification model outputs a single class from a set of more than two possible classes.
A linear regression model outputs a numerical value.

prediction bias

A metric used to quantify how much the average of a model's predictions deviates from the average of the actual label values in a dataset. Prediction bias indicates a systematic tendency for the model to either over-predict or under-predict compared to the true values. This term is distinct from the bias term within a model's formula and from the concept of bias in ethics and fairness.

predictive ML

A loosely defined term typically used to refer to standard or "classic" machine learning systems. The term predictive ML serves to distinguish this category of ML systems, which primarily focus on making predictions based on input data, from newer systems based on generative AI which focus on creating new content.

predictive parity

responsible
Metric

A fairness metric used to assess whether a classifier model is making predictions with equivalent precision rates across different subgroups of a population. Predictive parity is satisfied if the precision score is the same for all subgroups under consideration, meaning that when the model predicts the positive class, it is correct equally often for individuals in each subgroup. For example, a model predicting college acceptance would satisfy predictive parity with respect to nationality if its precision rate for predicting acceptance is the same for applicants of different nationalities. Predictive parity is also sometimes referred to as predictive rate parity.

predictive rate parity

Another term used as a synonym for predictive parity.

preprocessing

fundamentals

The essential stage of processing raw data before it is used to train a model. Preprocessing can involve a wide range of transformations and cleaning steps, from relatively simple tasks like removing irrelevant words from a text corpus to more complex operations. These complex steps can include re-expressing data points in ways that aim to reduce or eliminate attributes that are correlated with sensitive attributes. Effective preprocessing is crucial for preparing data in a format suitable for the model and can play a significant role in improving model performance, generalization, and helping to satisfy fairness constraints.

pre-trained model

language
image
generativeAI

A model that has already undergone an initial phase of training. The term pre-trained model can refer to a previously trained model architecture with learned parameters, or sometimes to a previously trained embedding vector. When specifically referring to language models, the term pre-trained language model typically denotes a large language model that has completed its initial pre-training on a massive text dataset.

pre-training

language
image
generativeAI

The initial and often most computationally intensive training phase for a model, performed on a very large and general dataset. The goal of pre-training is to learn a broad range of fundamental features, patterns, and representations from the data. Some pre-trained models resulting from this phase may be somewhat "clumsy" or not directly optimized for a specific downstream task, and therefore typically require further refinement through additional training steps. For example, a large language model might be pre-trained on an enormous text corpus like all the English pages on Wikipedia. Following pre-training, the resulting pre-trained model can be further refined using techniques such as distillation, fine-tuning, instruction tuning, parameter-efficient tuning, or prompt tuning.

prior belief

In a Bayesian context, a prior belief represents the assumptions or knowledge that one has about the data or model parameters before commencing the training process. This prior knowledge can influence the learning process. For instance, L2 regularization incorporates a prior belief that the weights of a model should ideally be small and clustered around zero, effectively penalizing larger weights during training.

probabilistic regression model

A type of regression model that goes beyond simply predicting a single numerical value. A probabilistic regression model incorporates an understanding of the uncertainty associated with its parameters (the weights) and thus provides a prediction that includes a measure of this uncertainty. For example, instead of predicting a house price of exactly 325,000 Euros, a probabilistic regression model might predict a price of 325,000 Euros with a standard deviation of 12,000 Euros, indicating the estimated range of variability or uncertainty around the prediction.

probability density function

A function used in machine learning and statistics, particularly for distributions of continuous numerical data. A probability density function (PDF) describes the likelihood of a continuous random variable taking on a specific value. While the probability of a continuous variable equaling any single exact value is theoretically zero, the PDF's significance lies in the fact that the integral of the function over a given range of values (from $x$ to $y$) provides the expected frequency or probability of data samples falling within that range. For example, for a normal distribution, integrating the PDF over a certain interval can tell you the proportion of data points expected to fall within that interval.

prompt

In the context of large language models, a prompt is any text input provided by a user to condition the model and guide its behavior or the nature of its response. Prompts can vary significantly in length, from a short phrase to arbitrarily long texts like an entire novel. Prompts are diverse and can fall into various categories, including questions, instructions asking the LLM to perform a task, examples demonstrating the desired output format, role-playing instructions, or partial inputs for the model to complete. Generative AI models are capable of responding to prompts with a wide range of outputs, including text, code, images, or even embeddings.

prompt-based learning

A capability exhibited by certain models, predominantly large language models, that allows them to adapt their behavior and generate responses based directly on arbitrary text input provided as a prompt. In the prompt-based learning paradigm, the LLM is not explicitly trained on a vast dataset for every conceivable task. Instead, it leverages the broad knowledge acquired during its pre-training and potentially fine-tuning to understand the instructions or context provided in the prompt and generate a relevant and useful response. The effectiveness of prompt-based learning can often be enhanced through techniques like human feedback.

prompt design

Synonym for prompt engineering.

prompt engineering

The iterative and skillful process of creating and refining prompts to effectively elicit the desired responses or behaviors from a large language model. Prompt engineering is typically performed by humans and is considered an essential skill for effectively utilizing LLMs to achieve specific outcomes. The success of prompt engineering is influenced by various factors, including the characteristics of the datasets used to pre-train and potentially fine-tune the LLM, as well as the decoding parameters (like temperature) used by the model to generate its responses. Prompt design is another term used for prompt engineering.

prompt tuning

A parameter-efficient tuning technique for adapting large language models to specific tasks. In prompt tuning, instead of modifying all the parameters of the pre-trained model, a small set of trainable parameters, often referred to as a "prefix" or "soft prompt," is learned and prepended to the actual input prompt during training. This prefix is learned for the specific task while the rest of the pre-trained model's parameters remain frozen. A variation called prefix tuning prepends this learned prefix at every layer of the model, whereas most prompt tuning methods only add it to the input layer.

proxy (sensitive attributes)

An attribute that is used as an indirect stand-in or substitute for a sensitive attribute. Proxies are often correlated with sensitive attributes but are not the sensitive attribute itself. For example, an individual's postal code might function as a proxy for their income level, race, or ethnicity because these demographic characteristics can be correlated with geographical location. The use of proxies can inadvertently introduce or perpetuate bias in models, even if the sensitive attribute itself is not directly used as a feature.

proxy labels

Proxy labels are data points or values that are used to approximate the true labels for a task when the actual labels are not directly available within a dataset. For instance, if you need to train a model to predict employee stress levels but your dataset lacks a direct "stress level" label, you might choose "workplace accidents" as a proxy label, assuming that higher stress correlates with more accidents. However, proxy labels are often imperfect representations of the true concept they are intended to approximate. When actual labels are unavailable, the selection of a proxy label should be made very carefully, aiming for the least imperfect candidate.

pure function

In programming, a pure function is a function whose output is solely determined by its input values, and which does not produce any observable side effects. This means that calling a pure function with the same input arguments will always yield the same output, regardless of the function's execution context or any external state. Additionally, a pure function does not modify any state outside of its own scope, such as changing the value of a global variable or writing to a file. Pure functions are valuable in parallel and distributed computing because they can be executed independently and concurrently without concerns about race conditions or shared state issues. Frameworks like JAX require that the functions used with their transformation methods are pure functions.