The Living Thing / Notebooks :


labelling losses, fitting classifiers etc


Precision/Recall and f-scores all work for multi-label classification, although they have bad qualities in unbalanced classes.

Unbalanced class problems

Metric Zoo

Matthews correlation coefficient

Due to Matthews (Matt75). This more or less the ultimate measure, since its behaviour is reasonable for 2 class or multi class, balanced or unbalanced. Unless you have a very different importance for your classes, this is a good default.

However, it is not differentiable with respect to classification certainties, so you can’t use it as, e.g., a target in neural nets etc.

2-class case

Take your \(2 times 2\). confusion matrix of true positive, false positives etc.

\begin{equation*} {\text{MCC}}={\frac {TP\times TN-FP\times FN}{{\sqrt {(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}} \end{equation*}
\begin{equation*} |{\text{MCC}}|={\sqrt {{\frac {\chi ^{2}}{n}}}} \end{equation*}

Multiclass case

Take your \(K times K\) confusion matrix \(C\), then

\begin{equation*} {\displaystyle {\text{MCC}}={\frac {\sum _{k}\sum _{l}\sum _{m}C_{kk}C_{lm}-C_{kl}C_{mk}}{{\sqrt {\sum _{k}(\sum _{l}C_{kl})(\sum _{k'|k'\neq k}\sum _{l'}C_{k'l'})}}{\sqrt {\sum _{k}(\sum _{l}C_{lk})(\sum _{k'|k'\neq k}\sum _{l'}C_{l'k'})}}}}} \end{equation*}


Receiver Operator Characteristic/Area Under Curve. Supposedly dates back to radar operators in the mid-century. HaMc83 talk about the AUC for radiology; Supposedly Spac89 introduced it to machine learning, but I haven’t read the article in question. Allows you to trade off importance of false positive/false negatives.

Binary cross entropy

I’d better write down form for this, since most ML toolkits are curiously shy about it.

Let \(x\) be the estimated probability and \(z\) be the supervised class label. Then the binary cross entropy loss is

\begin{equation*} \ell(x,z) = -z\log(x) - (1-z)\log(1-x) \end{equation*}

If \(y=\operatorname{logit}(x)\) is not a probability but a logit, then the numerically stable version is

\begin{equation*} \ell(y,z) = \max\{y,0\} - y + \log(1+\exp(-|x|)) \end{equation*}

f-measure et al



Flach, P., Hernández-Orallo, J., & Ferri, C. (2011) A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance. In Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 657–664).
Gorodkin, J. (2004) Comparing two K-category assignments by a K-category correlation coefficient. Computational Biology and Chemistry, 28(5–6), 367–374. DOI.
Hand, D. J.(2009) Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, 77(1), 103–123. DOI.
Hanley, J. A., & McNeil, B. J.(1983) A method of comparing the areas under receiver operating characteristic curves derived from the same cases. Radiology, 148(3), 839–843. DOI.
Lobo, J. M., Jiménez-Valverde, A., & Real, R. (2008) AUC: a misleading measure of the performance of predictive distribution models. Global Ecology and Biogeography, 17(2), 145–151. DOI.
Matthews, B. W.(1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2), 442–451. DOI.
Powers, D. M.(2007) Evaluation: from Precision, Recall and F-measure to ROC, Informedness, Markedness and Correlation.
Reid, M. D., & Williamson, R. C.(2011) Information, Divergence and Risk for Binary Experiments. Journal of Machine Learning Research, 12(Mar), 731–817.
Spackman, K. A.(1989) Signal Detection Theory: Valuable Tools for Evaluating Inductive Learning. In Proceedings of the Sixth International Workshop on Machine Learning (pp. 160–163). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.