# Classification

## Multi-label

Precision/Recall and f-scores all work for multi-label classification, although they have bad qualities in unbalanced classes.

TBD.

## Metric Zoo

One of the less abstruse summaries of these is the scikit-learn classifier loss page, which includes both formulae and verbal descriptions; this is surprisingly hard to find on, e.g. the documentation for deep learning toolkits, in keeping with the field’s general taste for magical black boxes.

### Matthews correlation coefficient

Due to Matthews (Matt75). This is the first choice for seamlessly handling multi-label problems, since its behaviour is reasonable for 2 class or multi class, balanced or unbalanced, and it’s computationally cheap. Unless you have a very different importance for your classes, this is a good default.

However, it is not differentiable with respect to classification certainties, so you can’t use it as, e.g., a target in neural nets; Therefore you use surrogate measures which are differentiable and use this to track your progress.

#### 2-class case

Take your $$2 \times 2$$. confusion matrix of true positive, false positives etc.

${\text{MCC}}={\frac {TP\times TN-FP\times FN}{{\sqrt {(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}}}$

$|{\text{MCC}}|={\sqrt {{\frac {\chi ^{2}}{n}}}}$

#### Multiclass case

Take your $$K \times K$$ confusion matrix $$C$$, then

${\displaystyle {\text{MCC}}={\frac {\sum _{k}\sum _{l}\sum _{m}C_{kk}C_{lm}-C_{kl}C_{mk}}{{\sqrt {\sum _{k}(\sum _{l}C_{kl})(\sum _{k'|k'\neq k}\sum _{l'}C_{k'l'})}}{\sqrt {\sum _{k}(\sum _{l}C_{lk})(\sum _{k'|k'\neq k}\sum _{l'}C_{l'k'})}}}}}$

### ROC/AUC

Receiver Operator Characteristic/Area Under Curve. Supposedly dates back to radar operators in the mid-century. HaMc83 talk about the AUC for radiology; Supposedly Spac89 introduced it to machine learning, but I haven’t read the article in question. Allows you to trade off importance of false positive/false negatives.

### Cross entropy

I’d better write down form for this, since most ML toolkits are curiously shy about it.

Let $$x$$ be the estimated probability and $$z$$ be the supervised class label. Then the binary cross entropy loss is

$\ell(x,z) = -z\log(x) - (1-z)\log(1-x)$

If $$y=\operatorname{logit}(x)$$ is not a probability but a logit, then the numerically stable version is

$\ell(y,z) = \max\{y,0\} - y + \log(1+\exp(-|x|))$

TBD.