# Classification

### labelling losses, fitting classifiers etc

Usefulness: đź”§ đź”§
Novelty: đź’ˇ
Uncertainty: đź¤Ş đź¤Ş đź¤Ş
Incompleteness: đźš§ đźš§

## Multi-label

Precision/Recall and f-scores all work for multi-label classification, although they have bad qualities in unbalanced classes.

đźš§

## Metric Zoo

One of the less abstruse summaries of these is the scikit-learn classifier loss page, which includes both formulae and verbal descriptions; this is surprisingly hard to find on, e.g.Â the documentation for deep learning toolkits, in keeping with the fieldâ€™s general taste for magical black boxes. The Pirates guide to various scores provides an easy introduction.

### Matthews correlation coefficient

Due to Matthews (Matt75). This is the first choice for seamlessly handling multi-label problems, since its behaviour is reasonable for 2 class or multi class, balanced or unbalanced, and itâ€™s computationally cheap. Unless you have a very different importance for your classes, this is a good default.

However, it is not differentiable with respect to classification certainties, so you canâ€™t use it as, e.g., a target in neural nets; Therefore you use surrogate measures which are differentiable and use this to track your progress.

#### 2-class case

Take your $$2 \times 2$$. confusion matrix of true positive, false positives etc.

${\text{MCC}}={\frac {TP\times TN-FP\times FN}{{\sqrt {(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}}$

$|{\text{MCC}}|={\sqrt {{\frac {\chi ^{2}}{n}}}}$

#### Multiclass case

Take your $$K \times K$$ confusion matrix $$C$$, then

${\displaystyle {\text{MCC}}={\frac {\sum _{k}\sum _{l}\sum _{m}C_{kk}C_{lm}-C_{kl}C_{mk}}{{\sqrt {\sum _{k}(\sum _{l}C_{kl})(\sum _{k'|k'\neq k}\sum _{l'}C_{k'l'})}}{\sqrt {\sum _{k}(\sum _{l}C_{lk})(\sum _{k'|k'\neq k}\sum _{l'}C_{l'k'})}}}}}$

### ROC/AUC

Receiver Operator Characteristic/Area Under Curve. Supposedly dates back to radar operators in the mid-century. HaMc83 talk about the AUC for radiology; Supposedly Spac89 introduced it to machine learning, but I havenâ€™t read the article in question. Allows you to trade off importance of false positive/false negatives.

### Cross entropy

Iâ€™d better write down form for this, since most ML toolkits are curiously shy about it.

Let $$x$$ be the estimated probability and $$z$$ be the supervised class label. Then the binary cross entropy loss is

$\ell(x,z) = -z\log(x) - (1-z)\log(1-x)$

If $$y=\operatorname{logit}(x)$$ is not a probability but a logit, then the numerically stable version is

$\ell(y,z) = \max\{y,0\} - y + \log(1+\exp(-|x|))$

đźš§

# Refs

Flach, Peter, JosĂ© HernĂˇndez-Orallo, and Cesar Ferri. 2011. â€śA Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance.â€ť In Proceedings of the 28th International Conference on Machine Learning (ICML-11), 657â€“64. http://www.icml-2011.org/papers/385_icmlpaper.pdf.

Gorodkin, J. 2004. â€śComparing Two K-Category Assignments by a K-Category Correlation Coefficient.â€ť Computational Biology and Chemistry 28 (5-6): 367â€“74. https://doi.org/10.1016/j.compbiolchem.2004.09.006.

Hand, David J. 2009. â€śMeasuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve.â€ť Machine Learning 77 (1): 103â€“23. https://doi.org/10.1007/s10994-009-5119-5.

Hanley, J A, and B J McNeil. 1983. â€śA Method of Comparing the Areas Under Receiver Operating Characteristic Curves Derived from the Same Cases.â€ť Radiology 148 (3): 839â€“43. https://doi.org/10.1148/radiology.148.3.6878708.

Jung, Alexander, Alfred O. Hero III, Alexandru Mara, and Saeed Jahromi. 2016. â€śSemi-Supervised Learning via Sparse Label Propagation,â€ť December. http://arxiv.org/abs/1612.01414.

Lobo, Jorge M., Alberto JimĂ©nez-Valverde, and Raimundo Real. 2008. â€śAUC: A Misleading Measure of the Performance of Predictive Distribution Models.â€ť Global Ecology and Biogeography 17 (2): 145â€“51. https://doi.org/10.1111/j.1466-8238.2007.00358.x.

Matthews, B. W. 1975. â€śComparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme.â€ť Biochimica et Biophysica Acta (BBA) - Protein Structure 405 (2): 442â€“51. https://doi.org/10.1016/0005-2795(75)90109-9.

Powers, David Martin. 2007. â€śEvaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation.â€ť http://dspace.flinders.edu.au/xmlui/handle/2328/27165.

Reid, Mark D., and Robert C. Williamson. 2011. â€śInformation, Divergence and Risk for Binary Experiments.â€ť Journal of Machine Learning Research 12 (Mar): 731â€“817. http://www.jmlr.org/papers/v12/reid11a.html.

Spackman, Kent A. 1989. â€śSignal Detection Theory: Valuable Tools for Evaluating Inductive Learning.â€ť In Proceedings of the Sixth International Workshop on Machine Learning, 160â€“63. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=102118.102172.