The Living Thing / Notebooks :


entropies and other measures of surprise

Not: what you hope to get from the newspaper. Rather: Different types of (formally defined) entropy/information and their disambiguation. The seductive power of the logarithm and convex functions rather like it.

A proven path to publication is to find or reinvent a derived measure based on Shannon information, and apply it to something provocative-sounding. (Qualia! Stock markets! Evolution! Language! The qualia of evolving stock market languages!)

This is purely about the analytic definition between given random variables. If you wish to estimate such a quantity empirically, from your experiment, see estimating probability metrics or, relatedly although not the same, information criteria.

Connected also to functional equations and yes, statistical mechanics, and quantum information physics although never so closely as one would like so as to conflate some disciplines.

Shannon Information

Vanilla information, thanks be to Claude Shannon. You have are given a discrete random process of specified parameters. How much can you compress it down to a more parsimonious process? (leaving coding theory aside for the moment.)

Given a random variable \(X\) taking values \(x \in \mathcal{X}\) from some discrete alphabet \(\mathcal{X}\), with probability mass function \(p(x)\).

\begin{equation*} \begin{array}{ccc} H(x) & := & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \\ & \equiv & E( \log 1/p(x) ) \end{array} \end{equation*}

Over at the Functional equations page I note that Tom Leinster has a clever proof of the optimality of Shannon information via functional equations.

One interesting aspect of the proof is where the difficulty lies. Let \(I:\Delta_n \to \mathbb{R}^+\) be continuous functions satisfying the chain rule; we have to show that \(I\) is proportional to \(H\). All the effort and ingenuity goes into showing that \(I\) is proportional to \(H\) when restricted to the uniform distributions. In other words, the hard part is to show that there exists a constant \(c\) such that

\begin{equation*} I(1/n, \ldots, 1/n) = c H(1/n, \ldots, 1/n) \end{equation*}

for all \(n \geq 1\).

K-L divergence

Because “Kullback-Leibler divergence” is a lot of syllables for something you use so often, even if usually in sentences like “unlike the K-L divergences”. Or you could call it the “relative entropy”, but that sounds like something to do with my uncle after the seventh round of Christmas drinks.

It is defined between the probability mass functions of two discrete random variables, \(P,Q\), where those probability mass functions are given \(p(x)\) and \(q(x)\) respectively.

\begin{equation*} \begin{array}{cccc} D(P \parallel Q) & := & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \frac{p(x)}{q(x)} \\ & \equiv & E \log p(x) \frac{p(x)}{q(x)} \end{array} \end{equation*}

Mutual information

The “informativeness” of one variable given another… Most simply, the K-L divergence between the product distribution and the joint distribution of two random variables. (That is, it vanishes if the two variables are independent).

Now, take \(X\) and \(Y\) with joint probability mass distribution \(p_{XY}(x,y)\) and, for clarity, marginal distributions \(p_X\) and \(p_Y\).

Then the mutual information \(I\) is given

\begin{equation*} I(X; Y) = H(X) - X(X|Y) \end{equation*}

Estimating this one has been giving me grief lately, so I’ll be happy when I get to this section and solve it forever. See nonparametric mutual information.

Getting an intuition of what this measure does is handy, so I’ll expound some equivalent definitions that emphasis different characteristics:

\begin{equation*} \begin{array}{cccc} I(X; Y) & := & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p_{XY}(x, y) \log p(x, y) \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)} \\ & = & D( p_{XY} \parallel p_X p_Y) \\ & = & E \log \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)} \end{array} \end{equation*}

Kolmogorov-Sinai entropy

Schreiber says:

If \(I\) is obtained by coarse graining a continuous system \(X\) at resolution \(\epsilon\), the entropy \(HX(\epsilon)\) and entropy rate \(hX(\epsilon)\) will depend on the partitioning and in general diverge like \(\log(\epsilon)\) when \(\epsilon \to 0\). However, for the special case of a deterministic dynamical system, \(lim_{\epsilon\to 0} hX (\epsilon) = hKS\) may exist and is then called the Kolmogorov-Sinai entropy. (For non-Markov systems, also the limit \(k \to \infty\) needs to be taken.)

That is, it is a special case of the entropy rate for a dynamical system. - Cue connection to algorithmic complexity. Also metric entropy?

Alternative formulations and relatives

Rényi Information

Also, the Hartley measure.

You don’t need to use a logarithm in your information summation. Free energy, something something. (?)

The observation that many of the attractive features of information measures are simply due to the concavity of the logarithm term in the function. So, why not whack another concave function with even more handy features in there? Bam, you are now working on Rényi information. How do you feel?

Tsallis statistics

Attempting to make information measures “non-extensive”. “q-entropy”. Seems to have made a big splash in Brazil, but less in other countries. Non-extensive measures are an intriguing idea, though. I wonder if it’s parochialism that keeps everyone off Tsallis statistics, or a lack of demonstrated usefulness?

Fisher information

See maximum likelihood and information criteria.

Estimating information

Wait, you don’t know the exact parameters of your generating process a priori? You need to estimate it from data.


Adami, C. (2004) Information theory in molecular biology. Physics of Life Reviews, 1(1), 3–22. DOI.
Ay, N., Bertschinger, N., Der, R., Güttler, F., & Olbrich, E. (2008) Predictive information and explorative behavior of autonomous robots. The European Physical Journal B - Condensed Matter and Complex Systems, 63(3), 329–339. DOI.
Ay, N., & Polani, D. (2008) Information flows in causal networks. Advances in Complex Systems (ACS), 11(01), 17–41. DOI.
Barrett, A. B., Barnett, L., & Seth, A. K.(2010) Multivariate Granger causality and generalized variance. Phys. Rev. E, 81(4), 041907. DOI.
Blanc, J.-L., Pezard, L., & Lesne, A. (2011) Delay independence of mutual-information rate of two symbolic sequences. Phys. Rev. E, 84(3), 036214. DOI.
Ceguerra, R. V., Lizier, J. T., & Zomaya, A. Y.(2011) Information storage and transfer in the synchronization process in locally-connected networks. . Presented at the IEEE Symposium Series in Computational Intelligence (SSCI 2011) - IEEE Symposium on Artificial Life,
Chaitin, G. J.(1977) Algorithmic information theory. IBM Journal of Research and Development.
Chechik, G., & Tishby, N. (n.d.) Extracting relevant structures with side information.
Chiribella, G., D’Ariano, G. M., & Perinotti, P. (2011) Informational derivation of Quantum Theory. Physical Review A, 84(1), 012311. DOI.
Chow, C. K., & Liu, C. N.(1968) Approximating discrete probability distributions with dependence trees. Information Theory, IEEE Transactions On, 14, 462–467. DOI.
Cohen, J. E.(1962) Information theory and music. Behavioral Science, 7(2), 137–163. DOI.
Cover, T. M., Gács, P., & Gray, R. M.(1989) Kolmogorov’s Contributions to Information Theory and Algorithmic Complexity. The Annals of Probability, 17(3), 840–865. DOI.
Cover, T. M., & Thomas, J. A.(2006) Elements of Information Theory. . Wiley-Interscience
Csiszár, I., & Shields, P. C.(2004) Information theory and statistics: a tutorial. Foundations and Trends™ in Communications and Information Theory, 1(4), 417–528. DOI.
Dahlhaus, R. (1996) On the Kullback-Leibler information divergence of locally stationary processes. Stochastic Processes and Their Applications, 62(1), 139–168. DOI.
Darbellay, G. A., & Vajda, I. (1999) Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory, 45, 1315–1321. DOI.
Dewar, R. C.(2003) Information theory explanation of the fluctuation theorem, maximum entropy production and self-organized criticality in non-equilibrium stationary states. Journal of Physics A: Mathematical and General, 36, 631–641. DOI.
Eichler, M. (2001) Granger-causality graphs for multivariate time series. Granger-Causality Graphs for Multivariate Time Series.
Elidan, G., & Friedman, N. (2005) Learning Hidden Variable Networks: The Information Bottleneck Approach. J. Mach. Learn. Res., 6, 81–127.
Ellerman, D. (2017, May 22) Logical Information Theory: New Foundations for Information Theory.
Ephraim, Y., & Merhav, N. (2002) Hidden Markov processes. Information Theory, IEEE Transactions On, 48(6), 1518–1569. DOI.
Erb, I., & Ay, N. (2004) Multi-information in the thermodynamic limit. Journal of Statistical Physics, 115(3), 949–976. DOI.
Feldman, D. P.(1997) A brief introduction to: Information theory, excess entropy and computational mechanics. Department of Physics, University of California, July.
Feldman, D. P., & Crutchfield, J. P.(2004) Synchronizing to Periodicity: the Transient Information and Synchronization Time of Periodic Sequences. Advances in Complex Systems, 7(03), 329–355. DOI.
Feldman, D. P., McTague, C. S., & Crutchfield, J. P.(2008) The organization of intrinsic computation: Complexity-entropy diagrams and the diversity of natural information processing. Chaos: An Interdisciplinary Journal of Nonlinear Science, 18, 043106. DOI.
Friedman, N., Mosenzon, O., Slonim, N., & Tishby, N. (2001) Multivariate information bottleneck. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence (pp. 152–161). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Gács, P., Tromp, J., & Vitányi, P. M. B.(2001) Algorithmic statistics. IEEE Transactions on Information Theory, 47(6), 2443–2463. DOI.
Gao, S., Ver Steeg, G., & Galstyan, A. (2015) Efficient Estimation of Mutual Information for Strongly Dependent Variables. In Journal of Machine Learning Research (pp. 277–286).
Goyal, P. (2012) Information Physics—Towards a New Conception of Physical Reality. Information, 3(4), 567–594. DOI.
Granger, C. W. J.(1963) Economic processes involving feedback. Information and Control, 6(1), 28–48. DOI.
Grassberger, P. (1988) Finite sample corrections to entropy and dimension estimates. Physics Letters A, 128(6–7), 369–373. DOI.
Gray, R. M.(1991) Entropy and Information Theory. . New York: Springer-Verlag
Hartley, R. V. L.(1928) Transmission of Information.
Haussler, D., & Opper, M. (1997) Mutual information, metric entropy and cumulative relative entropy risk. The Annals of Statistics, 25(6), 2451–2492. DOI.
Hlaváčková-Schindler, K., Paluš, M., Vejmelka, M., & Bhattacharya, J. (2007) Causality detection based on information-theoretic approaches in time series analysis. Physics Reports, 441(1), 1–46. DOI.
Hutter, M. (2001) Distribution of Mutual Information. ArXiv:Cs/0112019.
Jaynes, E. T.(1963) Information Theory and Statistical Mechanics. In Statistical Physics (Vol. 3).
Jiao, J., Venkat, K., Han, Y., & Weissman, T. (2014) Maximum Likelihood Estimation of Functionals of Discrete Distributions. ArXiv:1406.6959 [Cs, Math, Stat].
Jiao, J., Venkat, K., Han, Y., & Weissman, T. (2015) Minimax Estimation of Functionals of Discrete Distributions. IEEE Transactions on Information Theory, 61(5), 2835–2885. DOI.
Kandasamy, K., Krishnamurthy, A., Poczos, B., Wasserman, L., & Robins, J. M.(2014) Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations. ArXiv:1411.4342 [Stat].
Keane, M. S., & O’Brien, G. L.(1994) A Bernoulli Factory. ACM Trans. Model. Comput. Simul., 4(2), 213–219. DOI.
Kelly Jr, J. L.(1956) A new interpretation of information rate. Bell System Technical Journal, 35(3), 917–926.
Klir, G. J.(2006) Uncertainty and information. . Wiley Online Library
Kontoyiannis, I., & Madiman, M. (2012) Sumset inequalities for differential entropy and mutual information. In Information Theory Proceedings (ISIT), 2012 IEEE International Symposium on (pp. 1261–1265). DOI.
Kraskov, A., Stögbauer, H., & Grassberger, P. (2004) Estimating mutual information. Physical Review E, 69, 066138. DOI.
Kuich, W. (1970) On the entropy of context-free languages. Information and Control, 16(2), 173–200. DOI.
Kullback, S., & Leibler, R. A.(1951) On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.
Leonenko, N. (2008) A class of Rényi information estimators for multidimensional densities. The Annals of Statistics, 36(5), 2153–2182. DOI.
Leskovec, J. (2012) Information Diffusion and External Influence in Networks. Eprint ArXiv:1206.1331.
Liese, F., & Vajda, I. (2006) On Divergences and Informations in Statistics and Information Theory. IEEE Transactions on Information Theory, 52(10), 4394–4412. DOI.
Lin, J. (1991) Divergence measures based on the Shannon entropy. Information Theory, IEEE Transactions On, 37(1), 145–151. DOI.
Lizier, J. T., Pritam, S., & Prokopenko, M. (2011) Information Dynamics in Small-World Boolean Networks. Artificial Life. DOI.
Lizier, J. T., & Prokopenko, M. (2010) Differentiating information transfer and causal effect. The European Physical Journal B - Condensed Matter and Complex Systems, 73(4), 605–615. DOI.
Lizier, J. T., Prokopenko, M., & Zomaya, A. Y.(2008a) A framework for the local information dynamics of distributed computation in complex systems.
Lizier, J. T., Prokopenko, M., & Zomaya, A. Y.(2008b) Local information transfer as a spatiotemporal filter for complex systems. Physical Review E, 77, 026110. DOI.
Madiman, M. (2008) On the entropy of sums. In Information Theory Workshop, 2008. ITW ’08. IEEE (pp. 303–307). DOI.
Marton, K. (2015) Logarithmic Sobolev inequalities in discrete product spaces: a proof by a transportation cost distance. ArXiv:1507.02803 [Math].
Maynard Smith, J. (2000) The Concept of Information in Biology. Philosophy of Science, 67(2), 177–194.
McGill, W. J.(1954) Multivariate information transmission. Information Theory, IRE Professional Group On, 4(4), 93–111. DOI.
Miller, D. J., & Liu, W. (2002) On the recovery of joint distributions from limited information. Journal of Econometrics, 107(1–2), 259–274. DOI.
Moon, K. R., & Hero III, A. O.(2014) Multivariate f-Divergence Estimation With Confidence. In NIPS 2014.
Nemenman, I., Shafee, F., & Bialek, W. (2001) Entropy and inference, revisited. In arXiv:physics/0108025.
Nykter, M., Price, N. D., Larjo, A., Aho, T., Kauffman, S. A., Yli-Harja, O., & Shmulevich, I. (2008) Critical Networks Exhibit Maximal Information Diversity in Structure-Dynamics Relationships.
Palomar, D. P., & Verdu, S. (2008) Lautum Information. IEEE Transactions on Information Theory, 54(3), 964–975. DOI.
Paninski, L. (2003) Estimation of entropy and mutual information. Neural Computation, 15(6), 1191–1253. DOI.
Panzeri, S., Senatore, R., Montemurro, M. A., & Petersen, R. S.(2007) Correcting for the sampling bias problem in spike train information measures. Journal of Neurophysiology, 98, 1064–1072. DOI.
Panzeri, S., & Treves, A. (1996) Analytical estimates of limited sampling biases in different information measures. Network: Computation in Neural Systems, 7(1), 87–107.
Pereira, F. (2000) Formal grammar and information theory: together again? Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 358(1769), 1239–1253. DOI.
Phat, V. N., Thanh, N. T., & Trinh, H. (2014) Full-Order observer design for nonlinear complex large-scale systems with unknown time-varying delayed interactions. Complexity, n/a-n/a. DOI.
Pinkerton, R. C.(1956) Information theory and melody. Scientific American, 194(2), 77–86. DOI.
Plotkin, J. B., & Nowak, M. A.(2000) Language Evolution and Information Theory. Journal of Theoretical Biology, 205, 147–159. DOI.
Prokopenko, M., Lizier, J. T., Obst, O., & Wang, X. R.(2011) Relating Fisher information to order parameters. Phys. Rev. E, 84(4), 041116. DOI.
Raginsky, M. (2011) Directed information and Pearl’s causal calculus. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton) (pp. 958–965). DOI.
Raginsky, M., & Sason, I. (2012) Concentration of Measure Inequalities in Information Theory, Communications and Coding. Foundations and Trends in Communications and Information Theory.
Rissanen, J. (2007) Information and complexity in statistical modeling. . New York: Springer
Roulston, M. S.(1999) Estimating the errors on measured entropy and mutual information. Physica D: Nonlinear Phenomena, 125(3–4), 285–294. DOI.
Ryabko, D., & Ryabko, B. (2010) Nonparametric Statistical Inference for Ergodic Processes. IEEE Transactions on Information Theory, 56(3), 1430–1435. DOI.
Schreiber, T. (2000) Measuring information transfer. Physical Review Letters, 85(2), 461–464.
Schuch, N., Harrison, S. K., Osborne, T. J., & Eisert, J. (2011) Information propagation for interacting-particle systems. Phys. Rev. A, 84(3), 032309. DOI.
Schürmann, T. (2015) A Note on Entropy Estimation. Neural Computation, 27(10), 2097–2106. DOI.
Shalizi, C. R., & Crutchfield, J. P.(2002) Information bottlenecks, causal states, and statistical relevance bases: how to represent relevant information in memoryless transduction. Advances in Complex Systems, 05(01), 91–95. DOI.
Shelah, S. (2000) Choiceless Polynomial Time Logic: Inability to Express. In P. G. Clote & H. Schwichtenberg (Eds.), Computer Science Logic (pp. 72–125). Springer Berlin Heidelberg
Shibata, R. (1997) Bootstrap estimate of Kullback-Leibler information for model selection. Statistica Sinica, 7, 375–394.
Shields, P. C.(1998) The interactions between ergodic theory and information theory. Information Theory, IEEE Transactions On, 44(6), 2079–2093. DOI.
Shlens, J., Kennel, M. B., Abarbanel, H. D. I., & Chichilnisky, E. J.(2007) Estimating Information Rates with Confidence Intervals in Neural Spike Trains. Neural Computation. DOI.
Slonim, N., Atwal, G. S., Tkačik, G., & Bialek, W. (2005) Information-based clustering. Proceedings of the National Academy of Sciences of the United States of America, 102, 18297–18302. DOI.
Slonim, N., Friedman, N., & Tishby, N. (2006) Multivariate information bottleneck. Neural Computation, 18(8), 1739–1789. DOI.
Slonim, N., & Tishby, N. (2000) Agglomerative information bottleneck. Advances in Neural Information Processing Systems, 12, 617–623.
Steeg, G. V., & Galstyan, A. (2015) The Information Sieve. ArXiv:1507.02284 [Cs, Math, Stat].
Strong, S. P., Koberle, R., de Ruyter van Steveninck, R. R., & Bialek, W. (1998) Entropy and Information in Neural Spike Trains. Phys. Rev. Lett., 80(1), 197–200. DOI.
Strouse, D. J., & Schwab, D. J.(2016) The deterministic information bottleneck. ArXiv:1604.00268 [Cond-Mat, q-Bio].
Studený, M. (2016) Basic facts concerning supermodular functions. ArXiv:1612.06599 [Math, Stat].
Studený, M., & Vejnarová, J. (1998) On multiinformation function as a tool for measuring stochastic dependence. In Learning in graphical models (pp. 261–297). Cambridge, Mass.: MIT Press
Taylor, S. F., Tishby, N., & Bialek, W. (2007) Information and fitness. Arxiv Preprint ArXiv:0712.4382.
Tishby, N., Pereira, F. C., & Bialek, W. (2000) The information bottleneck method. ArXiv:Physics/0004057.
Tishby, N., & Polani, D. (2011) Information theory of decisions and actions. In PERCEPTION-ACTION CYCLE (pp. 601–636). Springer
Toda, A. A.(2011) An Information-Theoretic Approach to Nonparametric Estimation, Model Selection, and Goodness of Fit. ArXiv:1103.4890 [Math, Stat].
Torkkola, K. (2003) Feature extraction by non parametric mutual information maximization. J. Mach. Learn. Res., 3, 1415–1438.
Vereshchagin, N. K., & Vitanyi, P. M. B.(2004) Kolmogorov’s structure functions and model selection. IEEE Transactions on Information Theory, 50(12), 3265–3290. DOI.
Vereshchagin, N. K., & Vitanyi, P. M. B.(2010) Rate Distortion and Denoising of Individual Data Using Kolmogorov Complexity. IEEE Transactions on Information Theory, 56(7), 3438–3454. DOI.
Victor, J. D.(2002) Binless strategies for estimation of information from neural data. Physical Review E, 66, 051903. DOI.
Weidmann, C., & Vetterli, M. (2012) Rate Distortion Behavior of Sparse Sources. IEEE Transactions on Information Theory, 58(8), 4969–4992. DOI.
Weissman, T., Kim, Y. H., & Permuter, H. H.(2013) Directed Information, Causal Estimation, and Communication in Continuous Time. IEEE Transactions on Information Theory, 59(3), 1271–1287. DOI.
Wolf, D. R., & Wolpert, D. H.(1994a) Estimating Functions of Distributions from A Finite Set of Samples, Part 2: Bayes Estimators for Mutual Information, Chi-Squared, Covariance and other Statistics. ArXiv:Comp-Gas/9403002.
Wolpert, D. H.(2006a) Information Theory—The Bridge Connecting Bounded Rational Game Theory and Statistical Physics. In Complex Engineered Systems (pp. 262–290). Springer Berlin Heidelberg
Wolpert, D. H.(2006b) What Information Theory says about Bounded Rational Best Response. In The Complex Networks of Economic Interactions (pp. 293–306). Springer
Wolpert, D. H., & Wolf, D. R.(1994b) Estimating Functions of Probability Distributions from a Finite Set of Samples, Part 1: Bayes Estimators and the Shannon Entropy. ArXiv:Comp-Gas/9403001.
Wu, Y., & Yang, P. (2014) Minimax rates of entropy estimation on large alphabets via best polynomial approximation. ArXiv:1407.0381 [Cs, Math, Stat].
Zhang, Z., & Grabchak, M. (2014) Nonparametric Estimation of Küllback-Leibler Divergence. Neural Computation, 26(11), 2570–2593. DOI.