Metrics

Poutyne offers two kinds of metrics: batch and epoch metrics. The main difference between batch and epoch metrics is that batch metrics are computed at each batch, whereas epoch metrics compute statistics for each batch and compute the metric at the end of the epoch. Batch metrics are passed to Model and ModelBundle.from_network() using the batch_metrics argument. Epoch metrics are passed to Model and ModelBundle.from_network() using the epoch_metrics argument.

In addition to the predefined metrics below, all PyTorch loss functions can be used by string name under their functional name. The key in callback logs associated with each is the same as its name but without the _loss suffix. For example, the loss function mse_loss() can be passed as a metric with the name 'mse_loss' or simply 'mse', and the keys will be 'mse' and 'val_mse' for the training and validation MSE, respectively. Note that you can also pass the PyTorch loss functions as a loss function in Model in the same way.

Warning

When using the batch_metrics argument, the metrics are computed for each batch. This can significantly slow down the computations depending on the metrics used. This mostly happens on non-decomposable metrics such as torchmetrics.AUROC where an ordering of the elements is necessary to compute the metric. In such a case, we advise using them as epoch metrics instead.

Here is an example using metrics:

from poutyne import Model, Accuracy, F1
import torchmetrics

model = Model(
    network,
    'sgd',
    'cross_entropy',

    batch_metrics=[Accuracy(), F1()],
    # Can also use a string in this case:
    # batch_metrics=['accuracy', 'f1'],

    epoch_metrics=[torchmetrics.AUROC(num_classes=10, task="multiclass")],
)
model.fit_dataset(train_dataset, valid_dataset)

Interface

There are two interfaces available for metrics. The first interface is the same as PyTorch loss functions: metric(y_pred, y_true). When using that interface, the metric is assumed to be decomposable and is averaged for the whole epoch. The batch size is inferred with poutyne.get_batch_size() using y_pred and y_true as values.

The second interface is defined by the Metric class. As documented in the class, it provides methods for updating and computing the metric. This interface is compatible with TorchMetrics, a library implementing many known metrics in PyTorch. See the TorchMetrics documentation for available TorchMetrics metrics.

Note that if one implements a metric intended as both a batch and epoch metric, the methods Metric.forward() and Metric.update() need to be implemented. To avoid implementing both methods, one can implement a TorchMetrics metric at the potential cost of higher computational load as described in the TorchMetrics documentation.

class poutyne.Metric(*args, **kwargs)[source]

The abstract class representing a metric which can be accumulated at each batch and calculated at the end of the epoch.

forward(y_pred, y_true)[source]

Update the current state of the metric and return the metric for the current batch. This method has to be implemented if the metric is used as a batch metric. If used as an epoch metric, it does not need to be implemented.

Parameters:

y_pred – The prediction of the model.
y_true – Target to evaluate the model.

Returns:

The value of the metric for the current batch.

update(y_pred, y_true) → None[source]

Update the current state of the metric. This method has to be implemented if the metric is used as an epoch metric. If used as a batch metric, it does not need to be implemented.

Parameters:

y_pred – The prediction of the model.
y_true – Target to evaluate the model.

abstractmethod compute()[source]: Compute and return the metric. Should not modify the state of metric.

abstractmethod reset() → None[source]: The information kept for the computation of the metric is cleaned so that a new epoch can be done.

Object-Oriented API

Below are classes for predefined metrics available in Poutyne.

class poutyne.Accuracy(*, ignore_index: int = -100, reduction: str = 'mean')[source]

This metric computes the accuracy using a similar interface to CrossEntropyLoss.

Parameters:

ignore_index (int) – Specifies a target value that is ignored and does not contribute to the accuracy. (Default value = -100)
reduction (string, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed.

Possible string name:

'acc'
'accuracy'

Keys in logs dictionary of callbacks:

Train: 'acc'
Validation: 'val_acc'

Shape:

Input: \((N, C)\) where C = number of classes, or \((N, C, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional accuracy.
Target: \((N)\) where each value is \(0 \leq \text{targets}[i] \leq C-1\), or \((N, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional accuracy.
Output: The accuracy.

class poutyne.BinaryAccuracy(*, threshold: float = 0.0, reduction: str = 'mean')[source]

This metric computes the accuracy using a similar interface to BCEWithLogitsLoss.

Parameters:

threshold (float) – the threshold for class \(1\). Default value is 0., that is a probability of sigmoid(0.) = 0.5.
reduction (string, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed.

Possible string name:

'bin_acc'
'binary_acc'
'binary_accuracy'

Keys in logs dictionary of callbacks:

Train: 'bin_acc'
Validation: 'val_bin_acc'

Shape:

Input: \((N, *)\) where \(*\) means, any number of additional dimensions
Target: \((N, *)\), same shape as the input
Output: The binary accuracy.

class poutyne.TopKAccuracy(k: int, *, ignore_index: int = -100, reduction: str = 'mean')[source]

This metric computes the top-k accuracy using a similar interface to CrossEntropyLoss.

Parameters:

k (int) – Specifies the value of k in the top-k accuracy.
ignore_index (int) – Specifies a target value that is ignored and does not contribute to the top-k accuracy. (Default value = -100)
reduction (string, optional) – Specifies the reduction to apply to the output: 'none' | 'mean' | 'sum'. 'none': no reduction will be applied, 'mean': the sum of the output will be divided by the number of elements in the output, 'sum': the output will be summed.

Possible string name:

'top{k}'
'top{k}_acc'
'top{k}_accuracy'

for {k} from 1 to 10, 20, 30, …, 100.

Keys in logs dictionary of callbacks:

Train: 'top{k}'
Validation: 'val_top{k}'

where {k} is replaced by the value of parameter k.

Shape:

Input: \((N, C)\) where C = number of classes, or \((N, C, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional top-k accuracy.
Target: \((N)\) where each value is \(0 \leq \text{targets}[i] \leq C-1\), or \((N, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional top-k accuracy.
Output: The top-k accuracy.

class poutyne.FBeta(*, metric: str | None = None, average: str | int = 'macro', beta: float = 1.0, pos_label: int = 1, ignore_index: int = -100, threshold: float = 0.0, names: str | List[str] | None = None, make_deterministic: bool | None = None)[source]

The source code of this class is under the Apache v2 License and was copied from the AllenNLP project and has been modified.

Compute precision, recall, F-measure and support for each class.

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

If we have precision and recall, the F-beta score is simply: F-beta = (1 + beta ** 2) * precision * recall / (beta ** 2 * precision + recall)

The F-beta score weights recall more than precision by a factor of beta. beta == 1.0 means recall and precision are equally important.

The support is the number of occurrences of each class in y_true.

Keys in logs dictionary of callbacks:

Train: '{metric}_{average}'
Validation: 'val_{metric}_{average}'

where {metric} and {average} are replaced by the value of their respective parameters.

Parameters:

metric (Optional[str]) – One of {‘fscore’, ‘precision’, ‘recall’}. Whether to return the F-score, the precision or the recall. When not provided, all three metrics are returned. (Default value = None)
average (Union[str, int]) –
One of {‘micro’ (default), ‘macro’, label_number} If the argument is of type integer, the score for this class (the label number) is calculated. Otherwise, this determines the type of averaging performed on the data:

'binary':
Calculate metrics with regard to a single class identified by the pos_label argument. This is equivalent to average=pos_label except that the binary mode is enforced, i.e. an exception will be raised if there are more than two prediction scores.

'micro':
Calculate metrics globally by counting the total true positives, false negatives and false positives.

'macro':
Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

(Default value = ‘macro’)
beta (float) – The strength of recall versus precision in the F-score. (Default value = 1.0)
pos_label (int) – The class with respect to which the metric is computed when average == 'binary'. Otherwise, this argument has no effect. (Default value = 1)
ignore_index (int) – Specifies a target value that is ignored. This also works in combination with a mask if provided. (Default value = -100)
threshold (float) – Threshold for when there is a single score for each prediction. If a sigmoid output is used, this should be between 0 and 1. A suggested value would be 0.5. If a logits output is used, the threshold would be between -inf and inf. The suggested default value is 0 as to give a probability of 0.5 if a sigmoid output were used. (Default = 0)
names (Optional[Union[str, List[str]]]) – The names associated to the metrics. It is a string when a single metric is requested. It is a list of 3 strings if all metrics are requested. (Default value = None)
make_deterministic (Optional[bool]) – Avoid non-deterministic operations in computations. This might make the code slower.

forward(y_pred: Tensor, y_true: Tensor | Tuple[Tensor, Tensor]) → float | Tuple[float][source]

Update the confusion matrix for calculating the F-score and compute the metrics for the current batch. See FBeta.compute() for details on the return value.

Parameters:

y_pred (torch.Tensor) – A tensor of predictions of shape (batch_size, num_classes, …).
y_true (Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]) – Ground truths. A tensor of the integer class label of shape (batch_size, …). It must be the same shape as the y_pred tensor without the num_classes dimension. It can also be a tuple with two tensors of the same shape, the first being the ground truths and the second being a mask.

Returns:

A float if a single metric is set in the __init__ or a tuple of floats (f-score, precision, recall) if all metrics are requested.

update(y_pred: Tensor, y_true: Tensor | Tuple[Tensor, Tensor]) → None[source]

Update the confusion matrix for calculating the F-score.

Parameters:

y_pred (torch.Tensor) – A tensor of predictions of shape (batch_size, num_classes, …).
y_true (Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]) – Ground truths. A tensor of the integer class label of shape (batch_size, …). It must be the same shape as the y_pred tensor without the num_classes dimension. It can also be a tuple with two tensors of the same shape, the first being the ground truths and the second being a mask.

compute() → float | Tuple[float][source]: Returns either a float if a single metric is set in the __init__ or a tuple of floats (f-score, precision, recall) if all metrics are requested.

reset() → None[source]: The information kept for the computation of the metric is cleaned so that a new epoch can be done.

class poutyne.F1(**kwargs)[source]

Alias class for FBeta where metric == 'fscore' and beta == 1.

Possible string name:

'f1'

Keys in logs dictionary of callbacks:

Train: 'fscore_{average}'
Validation: 'val_fscore_{average}'

where {average} is replaced by the value of the respective parameter.

class poutyne.Precision(**kwargs)[source]

Alias class for FBeta where metric == 'precision'.

Possible string name:

'precision'

Keys in logs dictionary of callbacks:

Train: 'precision_{average}'
Validation: 'val_precision_{average}'

where {average} is replaced by the value of the respective parameter.

class poutyne.Recall(**kwargs)[source]

Alias class for FBeta where metric == 'recall'.

Possible string name:

'recall'

Keys in logs dictionary of callbacks:

Train: 'recall_{average}'
Validation: 'val_recall_{average}'

where {average} is replaced by the value of the respective parameter.

class poutyne.BinaryF1(**kwargs)[source]

Alias class for FBeta where metric == 'fscore', average='binary' and beta == 1.

Possible string name:

'binary_f1'
'bin_f1'

Keys in logs dictionary of callbacks:

Train: 'bin_fscore'
Validation: 'val_bin_fscore'

class poutyne.BinaryPrecision(**kwargs)[source]

Alias class for FBeta where metric == 'precision' and average='binary'.

Possible string name:

'binary_precision'
'bin_precision'

Keys in logs dictionary of callbacks:

Train: 'bin_precision'
Validation: 'val_bin_precision'

class poutyne.BinaryRecall(**kwargs)[source]

Alias class for FBeta where metric == 'recall' and average='binary'.

Possible string name:

'binary_recall'
'bin_recall'

Keys in logs dictionary of callbacks:

Train: 'bin_recall'
Validation: 'val_bin_recall'

Wrap metrics with Scikit-learn-like interface (metric(y_true, y_pred, sample_weight=sample_weight, **kwargs)). The SKLearnMetrics object has to keep in memory the ground truths and predictions so that in can compute the metric at the end.

Example

from sklearn.metrics import roc_auc_score, average_precision_score
from poutyne import SKLearnMetrics
my_epoch_metric = SKLearnMetrics([roc_auc_score, average_precision_score])

Parameters:

funcs (Union[Callable, List[Callable]]) – A metric or a list of metrics with a scikit-learn-like interface.
kwargs (Optional[Union[dict, List[dict]]]) – Optional dictionary of list of dictionaries corresponding to keyword arguments to pass to each corresponding metric. (Default value = None)
names (Optional[Union[str, List[str]]]) – Optional string or list of strings corresponding to the names given to the metrics. By default, the names are the names of the functions.

forward(y_pred: Tensor, y_true: Tensor | Tuple[Tensor, Tensor]) → None[source]

Accumulate the predictions, ground truths and sample weights if any, and compute the metric for the current batch.

Parameters:

y_pred (torch.Tensor) – A tensor of predictions of the shape expected by the metric functions passed to the class.
y_true (Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]) – Ground truths. A tensor of ground truths of the shape expected by the metric functions passed to the class. It can also be a tuple with two tensors, the first being the ground truths and the second corresponding the sample_weight argument passed to the metric functions in Scikit-Learn.

update(y_pred: Tensor, y_true: Tensor | Tuple[Tensor, Tensor]) → None[source]

Accumulate the predictions, ground truths and sample weights if any.

Parameters:

y_pred (torch.Tensor) – A tensor of predictions of the shape expected by the metric functions passed to the class.
y_true (Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]) – Ground truths. A tensor of ground truths of the shape expected by the metric functions passed to the class. It can also be a tuple with two tensors, the first being the ground truths and the second corresponding the sample_weight argument passed to the metric functions in Scikit-Learn.

compute() → Dict[source]: Returns the metrics as a dictionary with the names as keys.

reset() → None[source]: The information kept for the computation of the metric is cleaned so that a new epoch can be done.

Functional

Below is a functional version of some of the classes in the Object-Oriented API section.

poutyne.acc(y_pred, y_true, *, ignore_index=-100, reduction='mean')[source]

Computes the accuracy.

This is a functional version of Accuracy.

See Accuracy for details.

poutyne.bin_acc(y_pred, y_true, *, threshold=0.0, reduction='mean')[source]

Computes the binary accuracy.

This is a functional version of BinaryAccuracy.

See BinaryAccuracy for details.

poutyne.topk(y_pred, y_true, k, *, ignore_index=-100, reduction='mean')[source]

Computes the top-k accuracy.

This is a functional version of TopKAccuracy.

See TopKAccuracy for details.

Computing Multiple Metrics at Once

When passing the metrics to Model and ModelBundle.from_network(), each metric name can be changed by passing a tuple (name, metric) instead of simply the metric function or object, where name is the alternative name of the metric.

Metrics can return multiple metrics (e.g. a metric could return an F1-score with the associated precision and recall). The metrics can be returned via an iterable (tuple, list, Numpy arrays, tensors, etc.) or via a mapping (e.g. a dict). However, in this case, the names of the different metric has to be passed in some way.

There are two ways to do so. The easiest one is to pass the metric as a tuple (names, metric) where names is a tuple containing a name for each metric returned. Another way is to override the attribute __name__ of the function or object so that it returns a tuple containing a name for all metrics returned. Note that, when the metric returns a mapping, the names of the different metrics must be keys in the mapping.

Examples:

import torch
from poutyne import Metric
from torchmetrics import F1Score, Precision, Recall, MetricCollection


my_custom_metric = lambda input, target: 42.0
my_custom_metric2 = lambda input, target: torch.tensor([42.0, 43.0])
my_custom_metric3 = lambda input, target: {'a': 42.0, 'b': 43.0}


class CustomMetric(Metric):
    def forward(self, y_pred, y_true):
        return self.compute()

    def update(self, y_pred, y_true):
        pass

    def compute(self):
        return torch.tensor([42.0, 43.0])

    def reset(self):
        pass


class CustomMetric2(Metric):
    def forward(self, y_pred, y_true):
        return self.compute()

    def update(self, y_pred, y_true):
        pass

    def compute(self):
        return {'c': 42.0, 'd': 43.0}

    def reset(self):
        pass


class CustomMetric3(Metric):
    def __init__(self):
        super().__init__()
        self.__name__ = ['e', 'f']

    def forward(self, y_pred, y_true):
        return self.compute()

    def update(self, y_pred, y_true):
        pass

    def compute(self):
        return torch.tensor([42.0, 43.0])

    def reset(self):
        pass


metric_collection = MetricCollection(
    [
        F1Score(num_classes=10, average="macro", task="multiclass"),
        Precision(num_classes=10, average="macro", task="multiclass"),
        Recall(num_classes=10, average="macro", task="multiclass"),
    ]
)

metrics = [
    ("custom_name", my_custom_metric),
    (("metric_1", "metric_2"), my_custom_metric2),
    (("a", "b"), my_custom_metric3),
    (("metric_3", "metric_4"), CustomMetric()),
    (("c", "d"), CustomMetric2()),

    # No need to pass the names since the class sets the attribute __name__.
    CustomMetric3(),

    # The names are the keys returned by MetricCollection.
    (("F1Score", "Precision", "Recall"), metric_collection),
]