Metrics
Poutyne offers two kinds of metrics: batch and epoch metrics.
The main difference between batch and epoch metrics is that batch metrics are computed at each batch, whereas epoch metrics compute statistics for each batch and compute the metric at the end of the epoch.
Batch metrics are passed to Model
and ModelBundle.from_network()
using the batch_metrics
argument.
Epoch metrics are passed to Model
and ModelBundle.from_network()
using the epoch_metrics
argument.
In addition to the predefined metrics below, all PyTorch loss functions can be used by string name under their functional name.
The key in callback logs
associated with each is the same as its name but without the _loss
suffix. For example, the loss function mse_loss()
can be passed as a metric with the name 'mse_loss'
or simply 'mse'
, and the keys will be 'mse'
and 'val_mse'
for the training and validation MSE, respectively.
Note that you can also pass the PyTorch loss functions as a loss function in Model
in the same way.
Warning
When using the batch_metrics
argument, the metrics are computed for each batch.
This can significantly slow down the computations depending on the metrics used.
This mostly happens on non-decomposable metrics such as torchmetrics.AUROC
where an ordering of the elements is necessary to compute the metric.
In such a case, we advise using them as epoch metrics instead.
Here is an example using metrics:
from poutyne import Model, Accuracy, F1
import torchmetrics
model = Model(
network,
'sgd',
'cross_entropy',
batch_metrics=[Accuracy(), F1()],
# Can also use a string in this case:
# batch_metrics=['accuracy', 'f1'],
epoch_metrics=[torchmetrics.AUROC(num_classes=10, task="multiclass")],
)
model.fit_dataset(train_dataset, valid_dataset)
Interface
There are two interfaces available for metrics.
The first interface is the same as PyTorch loss functions: metric(y_pred, y_true)
.
When using that interface, the metric is assumed to be decomposable and is averaged for the whole epoch.
The batch size is inferred with poutyne.get_batch_size()
using y_pred
and y_true
as values.
The second interface is defined by the Metric
class.
As documented in the class, it provides methods for updating and computing the metric.
This interface is compatible with TorchMetrics, a library implementing many known metrics in PyTorch.
See the TorchMetrics documentation for available TorchMetrics metrics.
Note that if one implements a metric intended as both a batch and epoch metric, the methods Metric.forward()
and Metric.update()
need to be implemented.
To avoid implementing both methods, one can implement a TorchMetrics metric at the potential cost of higher computational load as described in the TorchMetrics documentation.
- class poutyne.Metric(*args, **kwargs)[source]
The abstract class representing a metric which can be accumulated at each batch and calculated at the end of the epoch.
- forward(y_pred, y_true)[source]
Update the current state of the metric and return the metric for the current batch. This method has to be implemented if the metric is used as a batch metric. If used as an epoch metric, it does not need to be implemented.
- Parameters:
y_pred – The prediction of the model.
y_true – Target to evaluate the model.
- Returns:
The value of the metric for the current batch.
- update(y_pred, y_true) None [source]
Update the current state of the metric. This method has to be implemented if the metric is used as an epoch metric. If used as a batch metric, it does not need to be implemented.
- Parameters:
y_pred – The prediction of the model.
y_true – Target to evaluate the model.
Object-Oriented API
Below are classes for predefined metrics available in Poutyne.
- class poutyne.Accuracy(*, ignore_index: int = -100, reduction: str = 'mean')[source]
This metric computes the accuracy using a similar interface to
CrossEntropyLoss
.- Parameters:
ignore_index (int) – Specifies a target value that is ignored and does not contribute to the accuracy. (Default value = -100)
reduction (string, optional) – Specifies the reduction to apply to the output:
'none'
|'mean'
|'sum'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output,'sum'
: the output will be summed.
- Possible string name:
'acc'
'accuracy'
- Keys in
logs
dictionary of callbacks: Train:
'acc'
Validation:
'val_acc'
- Shape:
Input: \((N, C)\) where C = number of classes, or \((N, C, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional accuracy.
Target: \((N)\) where each value is \(0 \leq \text{targets}[i] \leq C-1\), or \((N, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional accuracy.
Output: The accuracy.
- class poutyne.BinaryAccuracy(*, threshold: float = 0.0, reduction: str = 'mean')[source]
This metric computes the accuracy using a similar interface to
BCEWithLogitsLoss
.- Parameters:
threshold (float) – the threshold for class \(1\). Default value is
0.
, that is a probability ofsigmoid(0.) = 0.5
.reduction (string, optional) – Specifies the reduction to apply to the output:
'none'
|'mean'
|'sum'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output,'sum'
: the output will be summed.
- Possible string name:
'bin_acc'
'binary_acc'
'binary_accuracy'
- Keys in
logs
dictionary of callbacks: Train:
'bin_acc'
Validation:
'val_bin_acc'
- Shape:
Input: \((N, *)\) where \(*\) means, any number of additional dimensions
Target: \((N, *)\), same shape as the input
Output: The binary accuracy.
- class poutyne.TopKAccuracy(k: int, *, ignore_index: int = -100, reduction: str = 'mean')[source]
This metric computes the top-k accuracy using a similar interface to
CrossEntropyLoss
.- Parameters:
k (int) – Specifies the value of
k
in the top-k accuracy.ignore_index (int) – Specifies a target value that is ignored and does not contribute to the top-k accuracy. (Default value = -100)
reduction (string, optional) – Specifies the reduction to apply to the output:
'none'
|'mean'
|'sum'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output,'sum'
: the output will be summed.
- Possible string name:
'top{k}'
'top{k}_acc'
'top{k}_accuracy'
for
{k}
from 1 to 10, 20, 30, …, 100.- Keys in
logs
dictionary of callbacks: Train:
'top{k}'
Validation:
'val_top{k}'
where
{k}
is replaced by the value of parameterk
.- Shape:
Input: \((N, C)\) where C = number of classes, or \((N, C, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional top-k accuracy.
Target: \((N)\) where each value is \(0 \leq \text{targets}[i] \leq C-1\), or \((N, d_1, d_2, ..., d_K)\) with \(K \geq 1\) in the case of K-dimensional top-k accuracy.
Output: The top-k accuracy.
- class poutyne.FBeta(*, metric: str | None = None, average: str | int = 'macro', beta: float = 1.0, pos_label: int = 1, ignore_index: int = -100, threshold: float = 0.0, names: str | List[str] | None = None)[source]
The source code of this class is under the Apache v2 License and was copied from the AllenNLP project and has been modified.
Compute precision, recall, F-measure and support for each class.
The precision is the ratio
tp / (tp + fp)
wheretp
is the number of true positives andfp
the number of false positives. The precision is intuitively the ability of the classifier not to label as positive a sample that is negative.The recall is the ratio
tp / (tp + fn)
wheretp
is the number of true positives andfn
the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
If we have precision and recall, the F-beta score is simply:
F-beta = (1 + beta ** 2) * precision * recall / (beta ** 2 * precision + recall)
The F-beta score weights recall more than precision by a factor of
beta
.beta == 1.0
means recall and precision are equally important.The support is the number of occurrences of each class in
y_true
.- Keys in
logs
dictionary of callbacks: Train:
'{metric}_{average}'
Validation:
'val_{metric}_{average}'
where
{metric}
and{average}
are replaced by the value of their respective parameters.
- Parameters:
metric (Optional[str]) – One of {‘fscore’, ‘precision’, ‘recall’}. Whether to return the F-score, the precision or the recall. When not provided, all three metrics are returned. (Default value = None)
One of {‘micro’ (default), ‘macro’, label_number} If the argument is of type integer, the score for this class (the label number) is calculated. Otherwise, this determines the type of averaging performed on the data:
'binary'
:Calculate metrics with regard to a single class identified by the pos_label argument. This is equivalent to average=pos_label except that the binary mode is enforced, i.e. an exception will be raised if there are more than two prediction scores.
'micro'
:Calculate metrics globally by counting the total true positives, false negatives and false positives.
'macro'
:Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
(Default value = ‘macro’)
beta (float) – The strength of recall versus precision in the F-score. (Default value = 1.0)
pos_label (int) – The class with respect to which the metric is computed when
average == 'binary'
. Otherwise, this argument has no effect. (Default value = 1)ignore_index (int) – Specifies a target value that is ignored. This also works in combination with a mask if provided. (Default value = -100)
threshold (float) – Threshold for when there is a single score for each prediction. If a sigmoid output is used, this should be between 0 and 1. A suggested value would be 0.5. If a logits output is used, the threshold would be between -inf and inf. The suggested default value is 0 as to give a probability of 0.5 if a sigmoid output were used. (Default = 0)
names (Optional[Union[str, List[str]]]) – The names associated to the metrics. It is a string when a single metric is requested. It is a list of 3 strings if all metrics are requested. (Default value = None)
- forward(y_pred: Tensor, y_true: Tensor | Tuple[Tensor, Tensor]) float | Tuple[float] [source]
Update the confusion matrix for calculating the F-score and compute the metrics for the current batch. See
FBeta.compute()
for details on the return value.- Parameters:
y_pred (torch.Tensor) – A tensor of predictions of shape (batch_size, num_classes, …).
y_true (Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]) – Ground truths. A tensor of the integer class label of shape (batch_size, …). It must be the same shape as the
y_pred
tensor without thenum_classes
dimension. It can also be a tuple with two tensors of the same shape, the first being the ground truths and the second being a mask.
- Returns:
A float if a single metric is set in the
__init__
or a tuple of floats (f-score, precision, recall) if all metrics are requested.
- update(y_pred: Tensor, y_true: Tensor | Tuple[Tensor, Tensor]) None [source]
Update the confusion matrix for calculating the F-score.
- Parameters:
y_pred (torch.Tensor) – A tensor of predictions of shape (batch_size, num_classes, …).
y_true (Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]) – Ground truths. A tensor of the integer class label of shape (batch_size, …). It must be the same shape as the
y_pred
tensor without thenum_classes
dimension. It can also be a tuple with two tensors of the same shape, the first being the ground truths and the second being a mask.
- Keys in
- class poutyne.F1(**kwargs)[source]
Alias class for
FBeta
wheremetric == 'fscore'
andbeta == 1
.- Possible string name:
'f1'
- Keys in
logs
dictionary of callbacks: Train:
'fscore_{average}'
Validation:
'val_fscore_{average}'
where
{average}
is replaced by the value of the respective parameter.
- class poutyne.Precision(**kwargs)[source]
Alias class for
FBeta
wheremetric == 'precision'
.- Possible string name:
'precision'
- Keys in
logs
dictionary of callbacks: Train:
'precision_{average}'
Validation:
'val_precision_{average}'
where
{average}
is replaced by the value of the respective parameter.
- class poutyne.Recall(**kwargs)[source]
Alias class for
FBeta
wheremetric == 'recall'
.- Possible string name:
'recall'
- Keys in
logs
dictionary of callbacks: Train:
'recall_{average}'
Validation:
'val_recall_{average}'
where
{average}
is replaced by the value of the respective parameter.
- class poutyne.BinaryF1(**kwargs)[source]
Alias class for
FBeta
wheremetric == 'fscore'
,average='binary'
andbeta == 1
.- Possible string name:
'binary_f1'
'bin_f1'
- Keys in
logs
dictionary of callbacks: Train:
'bin_fscore'
Validation:
'val_bin_fscore'
- class poutyne.BinaryPrecision(**kwargs)[source]
Alias class for
FBeta
wheremetric == 'precision'
andaverage='binary'
.- Possible string name:
'binary_precision'
'bin_precision'
- Keys in
logs
dictionary of callbacks: Train:
'bin_precision'
Validation:
'val_bin_precision'
- class poutyne.BinaryRecall(**kwargs)[source]
Alias class for
FBeta
wheremetric == 'recall'
andaverage='binary'
.- Possible string name:
'binary_recall'
'bin_recall'
- Keys in
logs
dictionary of callbacks: Train:
'bin_recall'
Validation:
'val_bin_recall'
- class poutyne.SKLearnMetrics(funcs: Callable | List[Callable], kwargs: dict | List[dict] | None = None, names: str | List[str] | None = None)[source]
Wrap metrics with Scikit-learn-like interface (
metric(y_true, y_pred, sample_weight=sample_weight, **kwargs)
). TheSKLearnMetrics
object has to keep in memory the ground truths and predictions so that in can compute the metric at the end.Example
from sklearn.metrics import roc_auc_score, average_precision_score from poutyne import SKLearnMetrics my_epoch_metric = SKLearnMetrics([roc_auc_score, average_precision_score])
- Parameters:
funcs (Union[Callable, List[Callable]]) – A metric or a list of metrics with a scikit-learn-like interface.
kwargs (Optional[Union[dict, List[dict]]]) – Optional dictionary of list of dictionaries corresponding to keyword arguments to pass to each corresponding metric. (Default value = None)
names (Optional[Union[str, List[str]]]) – Optional string or list of strings corresponding to the names given to the metrics. By default, the names are the names of the functions.
- forward(y_pred: Tensor, y_true: Tensor | Tuple[Tensor, Tensor]) None [source]
Accumulate the predictions, ground truths and sample weights if any, and compute the metric for the current batch.
- Parameters:
y_pred (torch.Tensor) – A tensor of predictions of the shape expected by the metric functions passed to the class.
y_true (Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]) – Ground truths. A tensor of ground truths of the shape expected by the metric functions passed to the class. It can also be a tuple with two tensors, the first being the ground truths and the second corresponding the
sample_weight
argument passed to the metric functions in Scikit-Learn.
- update(y_pred: Tensor, y_true: Tensor | Tuple[Tensor, Tensor]) None [source]
Accumulate the predictions, ground truths and sample weights if any.
- Parameters:
y_pred (torch.Tensor) – A tensor of predictions of the shape expected by the metric functions passed to the class.
y_true (Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]]) – Ground truths. A tensor of ground truths of the shape expected by the metric functions passed to the class. It can also be a tuple with two tensors, the first being the ground truths and the second corresponding the
sample_weight
argument passed to the metric functions in Scikit-Learn.
Functional
Below is a functional version of some of the classes in the Object-Oriented API section.
- poutyne.acc(y_pred, y_true, *, ignore_index=-100, reduction='mean')[source]
Computes the accuracy.
This is a functional version of
Accuracy
.See
Accuracy
for details.
- poutyne.bin_acc(y_pred, y_true, *, threshold=0.0, reduction='mean')[source]
Computes the binary accuracy.
This is a functional version of
BinaryAccuracy
.See
BinaryAccuracy
for details.
- poutyne.topk(y_pred, y_true, k, *, ignore_index=-100, reduction='mean')[source]
Computes the top-k accuracy.
This is a functional version of
TopKAccuracy
.See
TopKAccuracy
for details.
Computing Multiple Metrics at Once
When passing the metrics to Model
and ModelBundle.from_network()
, each metric name can be changed by passing a tuple (name, metric)
instead of simply the metric function or object, where name
is the alternative name of the metric.
Metrics can return multiple metrics (e.g. a metric could return an F1-score with the associated precision and recall). The metrics can be returned via an iterable (tuple, list, Numpy arrays, tensors, etc.) or via a mapping (e.g. a dict). However, in this case, the names of the different metric has to be passed in some way.
There are two ways to do so.
The easiest one is to pass the metric as a tuple (names, metric)
where names
is a tuple containing a name for each metric returned.
Another way is to override the attribute __name__
of the function or object so that it returns a tuple containing a name for all metrics returned.
Note that, when the metric returns a mapping, the names of the different metrics must be keys in the mapping.
Examples:
import torch
from poutyne import Metric
from torchmetrics import F1Score, Precision, Recall, MetricCollection
my_custom_metric = lambda input, target: 42.0
my_custom_metric2 = lambda input, target: torch.tensor([42.0, 43.0])
my_custom_metric3 = lambda input, target: {'a': 42.0, 'b': 43.0}
class CustomMetric(Metric):
def forward(self, y_pred, y_true):
return self.compute()
def update(self, y_pred, y_true):
pass
def compute(self):
return torch.tensor([42.0, 43.0])
def reset(self):
pass
class CustomMetric2(Metric):
def forward(self, y_pred, y_true):
return self.compute()
def update(self, y_pred, y_true):
pass
def compute(self):
return {'c': 42.0, 'd': 43.0}
def reset(self):
pass
class CustomMetric3(Metric):
def __init__(self):
super().__init__()
self.__name__ = ['e', 'f']
def forward(self, y_pred, y_true):
return self.compute()
def update(self, y_pred, y_true):
pass
def compute(self):
return torch.tensor([42.0, 43.0])
def reset(self):
pass
metric_collection = MetricCollection(
[
F1Score(num_classes=10, average="macro", task="multiclass"),
Precision(num_classes=10, average="macro", task="multiclass"),
Recall(num_classes=10, average="macro", task="multiclass"),
]
)
metrics = [
("custom_name", my_custom_metric),
(("metric_1", "metric_2"), my_custom_metric2),
(("a", "b"), my_custom_metric3),
(("metric_3", "metric_4"), CustomMetric()),
(("c", "d"), CustomMetric2()),
# No need to pass the names since the class sets the attribute __name__.
CustomMetric3(),
# The names are the keys returned by MetricCollection.
(("F1Score", "Precision", "Recall"), metric_collection),
]