The Living Thing / Notebooks :

Count models (and regression thereof)

You have data and/or predictions made up of non-negative integers \({\mathbb N }\cup\{0\}\). What models can you fit to it?

I’m collecting some appropriate models for such data so that I can do regression which does not reduce to approximating them as Gaussian, or as Bernoulli, the extreme cases usually dealt with.

Also, is there a count-based formulation for non-negative regression? Non-negative matrix factorisations with appropriate loss function perhaps?


  1. raid the document topic model literature for this. Surely they are implicitly count data? See string bags compare with Steyvers and Tenenbaum’s semantic network model (StTe05).
  2. robust regression for all of these

All the distributions I discuss here have support unbounded above. Bounded distributions (e.g. vanilla Binomial) are for some other time. The exception is the Bernoulli RV, a.v. the biassed coin, which is simple enough to sneak in.

For the details of these models in a time series context, see

A lot of this material is probably in JoKK05.


The Poisson is reminiscent of the Gaussian for count data, in terms of the number of places that it pops up, and the vast number of things that have a limiting Poisson distribution.

Conveniently, it has only one parameter, which means you don’t need to justify how you chose any other parameters, saving valuable time and thought. It’s useful as a “null model”, in that the number of particles in realisation of a point process without interaction will be Poisson-distributed, conditional upon the mean measure. Conversely, non-Poisson residuals are evidence that your model has failed to take out some kind of interaction or hidden variable.

\(\text{Poisson}(\lambda)\) Pmf
\(\operatorname{P}(k;\lambda)={\frac {\lambda ^{k}}{k!}}e^{{-\lambda }}\) Mean
\(\lambda\) Variance
\(\lambda\) Pgf
\(G(s;\lambda)=\exp(\lambda (s-1))\)

Negative Binomial

Nearly a century old! (GrYu20) A generic count data model which, unlike the Poisson, has both location-like and scale-like parameters, instead of only one parameter. This makes one feel less dirty about a using a model more restrictive than standard linear regression, which sets the benchmark for being castigated for being too restrictive. Has a traditional rationale in terms of drawing balls from urns and such, which is of little interest here. the key point is that it is both flexible and uncontroversial.

It includes Geometric as special cases when \(r=1\), and the Gamma and Poisson as limiting cases. More precisely, the Poisson is a limiting case of the Polya distribution. for example, in the large-k limit, it approximates the Gamma distribution, and, when the mean is held constant, in the large \(r\) limit, it approaches Poisson. For fixed r it is an exponential family.

For all that, it’s still contrived, this model, and tedious to fit.

\(\text{NB}(p,r)\) Pmf
\(\operatorname{P}(k;p,r) = {k+r-1 \choose k}\cdot (1-p)^{r}p^{k} = \frac{\Gamma(k+r)}{k!\,\Gamma(r)} (1-p)^rp^k\) Mean
\({\frac {pr}{1-p}}\) Variance
\({\frac {pr}{(1-p)^{2}}}\) Pgf
\([G](#G){NB}(s;p,r)=\left({\frac {1-p}{1-ps}}\right)^{{\!r}}{\text{ for }}|s|<{\frac 1p}\)

Mean/dispersion parameterisation (Polya)

Commonly, where the \(r\) parameter is not required to be a non-negative integer, we call it a Polya model, and use it for over-dispersed data i.e. data that looks like a Poisson process if we are drunk, but whose variance is too big in comparison to its mean for soberer minds.

To see how that works we will reparameterise the model in terms of a “location” parameter \(\lambda\) and “dispersion”/scale-ish parameter \(\alpha\), such that we can rewrite it.

\(\text{Polya}(\lambda,\alpha)\) Pmf
\(\operatorname{P}(k;\lambda,\alpha) = \frac{\Gamma\left(k+\frac{1}{\alpha}\right)}{\Gamma(k+1)\Gamma\left(\frac{1}{\alpha}\right)} \left(\frac{\lambda}{\lambda+\frac{1}{\alpha}}\right)^k\left(1+\lambda\alpha\right)^{-\frac{1}{\alpha}}\) Mean
\(\lambda\) Variance
\(\lambda + \lambda^2\alpha\) Pgf
\(G(s;\lambda,\alpha)=(\alpha (\lambda -\lambda s)+1)^{-1/\alpha }\)

The log Pmf is then

\[ \log\Gamma\left(k+\frac{1}{\alpha}\right)-\log\Gamma(k+1)-\log\Gamma\left(\frac{1}{\alpha}\right)+k\left(\log\lambda-\log\left(\lambda+\frac{1}{\alpha}\right)\right)-\frac{\log(1+\lambda\alpha)}{\alpha} \]

It will be apparent that all these log-gamma differences will be numerically unstable, so we need to use different approximations for it depending in the combination of \(k,\lambda\) or \(\alpha\) values in play.

Aieee! Tedious!

I can’t understand why anyone bothers using the negative binomial as a model unless they have strong reason to think it is a true model; the GPD is easier to fit, in that the log Pmf is numerically tractable for all parameter combinations even with large count values, and it’s no less natural; IMO it usually has has more, IMO, plausible justifications than the Polya/NB model.


A discrete analogue of the exponential, i.e. The probability distribution of the number X of Bernoulli trials before the first success, supported on the set \(\{0, 1, 2, 3, …\}\)

\(\text{Geom}(p)\) Pmf
\(\operatorname{P}(k;p) = (1-p)^k\,p\!\) Pgf
\(G(s;p)=\frac{p}{1-s+sp}.\) Mean
\(\frac {1-p}{p}\) Variance
\(\frac {1-p}{p^{2}}\)

Note that \({\text{Geom}}(p)={\text{NB}}(1,\,1-p).\,\).

Mean parameterisation

We can parameterise this in terms of the mean \(\lambda=\frac {1-p}{p}\Rightarrow p=\frac{1}{\lambda+1}\)

\(\text{MGeom}(\lambda)\) Pmf
\(\operatorname{P}(k;\lambda) = \left(\frac{\lambda}{\lambda+1}\right)^k\frac{1}{\lambda+1}\) Pgf
\(G(s;\lambda)=\frac{\lambda+1}{s\lambda+1}.\) Mean
\(\lambda\) Variance

Lagrangian distributions

Another clade of distribution where we work backwards from the pgf, although we generate the distribution from a function this time, or rather, two functions. This family includes various others on this page; I will check which some day. For now, let’s get to the interesting new ones.

For it is more like a jungle of distributions, requiring a map to hack through it. There are various parameters estimation methods and properties, all applicable to different sub-families. Sometimes the forms of the mass function are explicit and easy. Others, not so much.

See CoFa06 for the authoritative list, plus JoKK05 Ch7.2 for a brusquer version.

It’s interesting to me because (CoFa06 Ch 6.2, CoSh88) the total cascade size of a subcritical branching process has a “delta Lagrangian” or “general Lagrangian” distribution, depending on whether the cluster has a deterministic or random starting population. We will define offspring distribution of such a branching process as \(G\sim G_Y(\eta, \alpha)\(, with\)EG:=1).

Let’s get specific.

Poisson-Poisson Lagrangian

See Consul and Famoye (CoFa06, 9.3). Also known as the Generalised Poisson, although there are many things called that.

there are many possible interpretations for this; I will choose the interpretation in terms of cascade sizes of branching processes, in which case we have


\(GPD(\mu,\eta)\) Pmf
\(\operatorname{P}(X=x;\mu,\eta)=\frac{\mu(\mu+ \eta x)^{x-1}}{x!e^{\mu+x\eta}}\) Mean
\(\frac{\mu}{1-\eta}\) Variance

Notice that this can produce long tails, in the sense that it can have a large variance with finite mean, but not heavy tails, in the sense of the variance becoming infinite while retaining a finite mean. (Q: What non-negative distribution has the quality of parameterised explosion of moments, apart from the inconvenient discrete-stable?)

Here, I implemented the Poisson-Poisson GPD in python for you.

Basic Lagrangian distribution

Summarised in CoSh72.

One parameter: a differentiable (infinitely?) function, not necessarily a pgf, \(g: [0,1]\rightarrow \mathbb{R} \text{ s.t. } g(0)\neq 0\text{ and } g(1)=1\). Now we define a pgf \(\psi(s)\) implicitly by the smallest root of the Lagrange transformation \(z=sg(z)\). The paradigmatic example of such a function is \(g:z\mapsto 1−p+pz\); let’s check this fella out.

? Pmf
? Mean
? Variance


Delta Lagrangian distributions


General Lagrangian distributions


Discrete Stable

Another generalisation of Poisson, with until-recently hip features such as a power-law tail.

By analogy with the continuous-stable distribution, a “stable” family for count data.

In particular, this is stable in the sense that it is a limiting distribution for sums of count random variables, analogous to the continuous stable family for real-valued RVs.

No (convenient) closed form for the Pmf in the general case, but the Pgf is simple, so that’s something.

\(\text{DS}(\nu,a)\) Pmf
\(\operatorname{P}(k;\nu,a)= \left.\frac{1}{k!} \frac{d^kG(s;\nu,a)}{ds^k}\right|[](#){s=0}.\) (which is simply the usual formula for extracting the Pmf from any Pgf.) Pgf
\(G(s;\nu,a)=\exp(-a (1-s)^\nu).\) Mean
\(\operatorname{E}(X) = G'(1^-) = a \nu e^{-a}.\) Variance
\(\operatorname{Var}(X)=G''(1^-) + \operatorname{E}(X) - \operatorname{E}^2(X).\)

Here, \(a>0\) is a scale parameter and \(0\lt\nu\leq 1\) a dispersion parameter describing in particular a power-law tail such that when \(\nu\lt 1\),

\[ \lim_{k \to \infty}\operatorname{P}(k;\nu,a) \simeq \frac{1}{k^{\nu+1}}. \]

Question: the Pgf formulation implies this is a non-negative distribution. Does that mean that symmetric discrete RVs cannot be stable? Possibly-negative ones?

Nola99 and Nola01 give some approximate ML-estimators of the parameter. Lee10 does some interesting stuff:

This thesis considers the interplay between the continuous and discrete
properties of random stochastic processes.
It is shown that the special cases of the one-sided Lévy-stable
can be connected to the class of discrete-stable distributions through a
doubly-stochastic Poisson transform.
This facilitates the creation of a one-sided stable process for which the
N-fold statistics can be factorised explicitly. […]
Using the same Poisson transform interrelationship, an exact method for
generating discrete-stable variates is found.
It has already been shown that discrete-stable distributions occur in the
crossing statistics of continuous processes whose autocorrelation exhibits
fractal properties.
The statistical properties of a nonlinear filter analogue of a phase-screen
model are calculated, and the level crossings of the intensity analysed.
The asymptotic properties of the inter-event density of the process are
found to be accurately approximated by a function of the Fano factor
and the mean of the crossings alone.

Zipf/Zeta models

The discrete version of the basic power-law models.

While we are here, the plainest explanation of the relation of Zips to Pareto distribution that I know is Lada Adamic’s Zipf, Power-laws, and Pareto - a ranking tutorial.

\(\text{Zipf}(s)\) Pmf
\(\operatorname{P}(k;s)={\frac {1/k^{s}}{\zeta (s)}}\) Mean
\({\frac {\zeta (s-1)}{\zeta (s)}}~{\textrm {for}}~s>2\) Variance
\({\frac {\zeta (s)\zeta (s-2)-\zeta (s-1)^{2}}{\zeta (s)^{2}}}~{\textrm {for}}~s>3\)

This has unbounded support. In the bounded case, it becomes the Zipf–Mandelbrot law, which is too fiddly for me to discuss here unless I turn out to really need it, which would likely be for ranking statistics.


\(\text{YS}(\rho)\) Pmf
\(\operatorname{P}(k;\rho)=\rho \,{\mathrm {B}}(k,\rho +1),\,\) Mean
\({\frac {\rho }{\rho -1}},\, \rho \gt 1\) Variance
\({\frac {\rho^2}{(\rho-1)^2\;(\rho -2)}}\,,\, \rho \gt 2\)

where B is the beta function.

Zipf law in the tail. See also the two-parameter version, which replaces the beta function with an incomplete beta function, giving Pmf \(\operatorname{P}(k;\rho,\alpha )={\frac {\rho }{1-\alpha ^{{\rho }}}}\;{\mathrm {B}}_{{1-\alpha }}(k,\rho +1),\,\)

I’m bored with this one too.


Exponential family count model with free variance parameter. See CHSH16.

Decomposability properties

For background, see decomposability.


By analogy with the continuous case we may construct a stability equation for count RVs:

\[ X(a) \triangleq W^{1/\alpha}\odot X(b), \]

\(\odot\) here is Steutel and van Harn’s discrete multiplication operator, which I won’t define here exhaustively because there are variously complex formulations of it, and I don’t care enough to wrangle them. In the simplest case it gives us a binomial thinning of the left operand by the right

\[ A\odot B \sim \mathrm{Binom}(A,B) \]


Poisson RVs are self-divisible, in the sense that

\[ X_1\sim \text{Poisson}(\lambda_1),\,X_2\sim \text{Poisson}(\lambda_2),\,X_1\perp X_2 \Rightarrow X_1+X_2\sim \text{Poisson}(\lambda_1+\lambda_2). \]

Polya RVs likewise are self-divisible, if uglier under this parameterisation

\[ X_1\sim \text{Polya}(\lambda_1, \alpha_1),\,X_2\sim \text{Polya}(\lambda_1, \alpha_1),\,X_1\perp X_2 \Rightarrow X_1+X_2\sim \text{Polya}\left(\lambda_1+\lambda_2,\, \frac{\alpha_1\lambda_1^2+\alpha_2\lambda_2^2}{(\lambda_1+\lambda_2)^2}\right) \operatorname{Var}(X_1+X_2)=\operatorname{Var}(X_1)+\operatorname{Var}(X_2) \]

So are GPDs, in \(\lambda\).