# Informations

### Entropies and other measures of surprise

Usefulness: đź”§
Novelty: đź’ˇ
Uncertainty: đź¤Ş đź¤Ş đź¤Ş
Incompleteness: đźš§ đźš§ đźš§

TODO: explain this diagram which I ripped of Wikipedia..

Not: what you hope to get from the newspaper. Rather: Different types of (formally defined) entropy/information and their disambiguation. The seductive power of the logarithm and convex functions rather like it.

A proven path to publication is to find or reinvent a derived measure based on Shannon information, and apply it to something provocative-sounding. (Qualia! Stock markets! Evolution! Language! The qualia of evolving stock market languages!)

This is purely about the analytic definition given random variables. If you wish to estimate such a quantity empirically, from your experiment, thatâ€™s a different problem.

Connected also to functional equations and yes, statistical mechanics, and quantum information physics.

## Shannon Information

Vanilla information, thanks be to Claude Shannon. You have are given a discrete random process of specified parameters. How much can you compress it down to a more parsimonious process? (leaving coding theory aside for the moment.)

Given a random variable $$X$$ taking values $$x \in \mathcal{X}$$ from some discrete alphabet $$\mathcal{X}$$, with probability mass function $$p(x)$$.

$\begin{array}{ccc} H(X) & := & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \\ & \equiv & E( \log 1/p(x) ) \end{array}$

More generally if we have a measure $$P$$ over some Borel space

$H(X)=-\int _{X}\log {\frac {\mathrm {d} P}{\mathrm {d} \mu }}\,dP$

Over at the Functional equations page I note that Tom Leinster has a clever proof of the optimality of Shannon information via functional equations.

One interesting aspect of the proof is where the difficulty lies. Let $$I:\Delta_n \to \mathbb{R}^+$$ be continuous functions satisfying the chain rule; we have to show that $$I$$ is proportional to $$H$$. All the effort and ingenuity goes into showing that $$I$$ is proportional to $$H$$ when restricted to the uniform distributions. In other words, the hard part is to show that there exists a constant $$c$$ such that

$I(1/n, \ldots, 1/n) = c H(1/n, \ldots, 1/n)$

for all $$n \geq 1$$.

Venkatesan Guruswami, Atri Rudra and Madhu Sudan, Essential Coding Theory.

## K-L divergence

Because â€śKullback-Leibler divergenceâ€ť is a lot of syllables for something you use so often, even if usually in sentences like â€śunlike the K-L divergencesâ€ť. Or you could call it the â€śrelative entropyâ€ť, but that sounds like something to do with my uncle after the seventh round of Christmas drinks.

It is defined between the probability mass functions of two discrete random variables, $$X,Y$$ over the same space, where those probability mass functions are given $$p(x)$$ and $$q(x)$$ respectively.

$\begin{array}{cccc} D(P \parallel Q) & := & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \frac{p(x)}{q(x)} \\ & \equiv & E \log p(x) \frac{p(x)}{q(x)} \end{array}$

More generally, if the random variables have laws, respectively $$P$$ and $$Q$$:

${\displaystyle D_{\operatorname {KL} }(P\|Q)=\int _{\operatorname {supp} P}{\frac {\mathrm {d} P}{\mathrm {d} Q}}\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\,dQ=\int _{\operatorname {supp} P}\log {\frac {\mathrm {d} P}{\mathrm {d} Q}}\,dP,}$

## Mutual information

The â€śinformativenessâ€ť of one variable given anotherâ€¦ Most simply, the K-L divergence between the product distribution and the joint distribution of two random variables. (That is, it vanishes if the two variables are independent).

Now, take $$X$$ and $$Y$$ with joint probability mass distribution $$p_{XY}(x,y)$$ and, for clarity, marginal distributions $$p_X$$ and $$p_Y$$.

Then the mutual information $$I$$ is given

$I(X; Y) = H(X) - H(X|Y)$

Estimating this one has been giving me grief lately, so Iâ€™ll be happy when I get to this section and solve it forever. See nonparametric mutual information.

Getting an intuition of what this measure does is handy, so Iâ€™ll expound some equivalent definitions that emphasises different characteristics:

$\begin{array}{cccc} I(X; Y) & := & \sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p_{XY}(x, y) \log p(x, y) \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)} \\ & = & D( p_{XY} \parallel p_X p_Y) \\ & = & E \log \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)} \end{array}$

More usually we want the Conditional Mutual information.

$I(X;Y|Z)=\int _{\mathcal {Z}}D_{\mathrm {KL} }(P_{(X,Y)|Z}\|P_{X|Z}\otimes P_{Y|Z})dP_{Z}$

## Kolmogorov-Sinai entropy

Schreiber says:

If $$I$$ is obtained by coarse graining a continuous system $$X$$ at resolution $$\epsilon$$, the entropy $$HX(\epsilon)$$ and entropy rate $$hX(\epsilon)$$ will depend on the partitioning and in general diverge like $$\log(\epsilon)$$ when $$\epsilon \to 0$$. However, for the special case of a deterministic dynamical system, $$\lim_{\epsilon\to 0} hX (\epsilon) = hKS$$ may exist and is then called the Kolmogorov-Sinai entropy. (For non-Markov systems, also the limit $$k \to \infty$$ needs to be taken.)

That is, it is a special case of the entropy rate for a dynamical system - Cue connection to algorithmic complexity. Also metric entropy?

## Relatives

Also, the Hartley measure.

You donâ€™t need to use a logarithm in your information summation. Free energy, something something. (?)

The observation that many of the attractive features of information measures are simply due to the concavity of the logarithm term in the function. So, why not whack another concave function with even more handy features in there? Bam, you are now working on RĂ©nyi information. How do you feel?

### Tsallis statistics

Attempting to make information measures â€śnon-extensiveâ€ť. â€śq-entropyâ€ť. Seems to have made a big splash in Brazil, but less in other countries. Non-extensive measures are an intriguing idea, though. I wonder if itâ€™s parochialism that keeps everyone off Tsallis statistics, or a lack of demonstrated usefulness?

## Estimating information

Wait, you donâ€™t know the exact parameters of your generating process a priori? You need to estimate it from data.

# Refs

Akaike, Hirotogu. 1973. â€śInformation Theory and an Extension of the Maximum Likelihood Principle.â€ť In Proceeding of the Second International Symposium on Information Theory, edited by Petrovand F Caski, 199â€“213. Budapest: Akademiai Kiado. http://link.springer.com/chapter/10.1007/978-1-4612-1694-0_15.

Amari, ShunĘĽichi. 2001. â€śInformation Geometry on Hierarchy of Probability Distributions.â€ť IEEE Transactions on Information Theory 47: 1701â€“11. https://doi.org/10.1109/18.930911.

Ay, N., N. Bertschinger, R. Der, F. GĂĽttler, and E. Olbrich. 2008. â€śPredictive Information and Explorative Behavior of Autonomous Robots.â€ť The European Physical Journal B - Condensed Matter and Complex Systems 63 (3): 329â€“39. https://doi.org/10.1140/epjb/e2008-00175-0.

Barnett, Lionel, Adam B. Barrett, and Anil K. Seth. 2009. â€śGranger Causality and Transfer Entropy Are Equivalent for Gaussian Variables.â€ť Physical Review Letters 103 (23): 238701. https://doi.org/10.1103/PhysRevLett.103.238701.

Barrett, Adam B, Lionel Barnett, and Anil K Seth. 2010. â€śMultivariate Granger Causality and Generalized Variance.â€ť Phys. Rev. E 81 (4): 041907. https://doi.org/10.1103/PhysRevE.81.041907.

Bialek, William, Ilya Nemenman, and Naftali Tishby. 2001. â€śComplexity Through Nonextensivity.â€ť Physica A: Statistical and Theoretical Physics 302 (1-4): 89â€“99. https://doi.org/10.1016/S0378-4371(01)00444-7.

â€”â€”â€”. 2006. â€śPredictability, Complexity, and Learning.â€ť Neural Computation 13 (11): 2409â€“63. https://doi.org/10.1162/089976601753195969.

Bieniawski, Stefan, and David H. Wolpert. 2004. â€śAdaptive, Distributed Control of Constrained Multi-Agent Systems.â€ť In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems-Volume 3, 4:1230â€“1. IEEE Computer Society. https://ti.arc.nasa.gov/m/profile/dhw/papers/7.pdf.

Calude, Cristian S. 2002. Information and Randomness : An Algorithmic Perspective. Springer.

CappĂ©, O, E Moulines, and T Ryden. 2005. Inference in Hidden Markov Models. Springer Verlag.

Cerf, Nicolas J., and Chris Adami. 1998. â€śInformation Theory of Quantum Entanglement and Measurement.â€ť Physica D: Nonlinear Phenomena 120 (1-2): 62â€“81.

Cerf, Nicolas J., and Christoph Adami. 1997. â€śEntropic Bell Inequalities.â€ť Physical Review A 55 (5): 3371.

Chaitin, Gregory J. 1977. â€śAlgorithmic Information Theory.â€ť IBM Journal of Research and Development.

â€”â€”â€”. 2002. â€śThe Intelligibility of the Universe and the Notions of Simplicity, Complexity and Irreducibility.â€ť

Chiribella, G, G M Dâ€™Ariano, and P Perinotti. 2011. â€śInformational Derivation of Quantum Theory.â€ť Physical Review A 84 (1): 012311. https://doi.org/10.1103/PhysRevA.84.012311.

Chow, C K, and C N Liu. 1968. â€śApproximating Discrete Probability Distributions with Dependence Trees.â€ť IEEE Transactions on Information Theory 14: 462â€“67. https://doi.org/10.1109/TIT.1968.1054142.

Cohen, Joel E. 1962. â€śInformation Theory and Music.â€ť Behavioral Science 7 (2): 137â€“63. https://doi.org/10.1002/bs.3830070202.

Cover, Thomas M., PĂ©ter GĂˇcs, and Robert M. Gray. 1989. â€śKolmogorovâ€™s Contributions to Information Theory and Algorithmic Complexity.â€ť The Annals of Probability 17 (3): 840â€“65. https://doi.org/10.1214/aop/1176991250.

Cover, Thomas M, and Joy A Thomas. 2006. Elements of Information Theory. Wiley-Interscience.

Crooks, Gavin E. 2007. â€śMeasuring Thermodynamic Length.â€ť Physical Review Letters 99 (10): 100602. https://doi.org/10.1103/PhysRevLett.99.100602.

CsiszĂˇr, Imre, and Paul C Shields. 2004. â€śInformation Theory and Statistics: A Tutorial.â€ť Foundations and Trendsâ„˘ in Communications and Information Theory 1 (4): 417â€“528. https://doi.org/10.1561/0100000004.

Dahlhaus, R. 1996. â€śOn the Kullback-Leibler Information Divergence of Locally Stationary Processes.â€ť Stochastic Processes and Their Applications 62 (1): 139â€“68. https://doi.org/10.1016/0304-4149(95)00090-9.

Dewar, Roderick C. 2003. â€śInformation Theory Explanation of the Fluctuation Theorem, Maximum Entropy Production and Self-Organized Criticality in Non-Equilibrium Stationary States.â€ť Journal of Physics A: Mathematical and General 36: 631â€“41. https://doi.org/10.1088/0305-4470/36/3/303.

Earman, John, and John D Norton. 1998. â€śExorcist XIV: The Wrath of Maxwellâ€™s Demon. Part I. From Maxwell to Szilard.â€ť Studies in History and Philosophy of Modern Physics 29 (4): 435â€“71. https://doi.org/10.1016/S1355-2198(98)00023-9.

â€”â€”â€”. 1999. â€śExorcist XIV: The Wrath of Maxwellâ€™s Demon. Part II. From Szilard to Landauer and Beyond.â€ť Studies in History and Philosophy of Modern Physics 30 (1): 1â€“40. https://doi.org/10.1016/S1355-2198(98)00026-4.

Eichler, Michael. 2001. â€śGranger-Causality Graphs for Multivariate Time Series.â€ť Granger-Causality Graphs for Multivariate Time Series. http://archiv.ub.uni-heidelberg.de/volltextserver/20749/1/beitrag.64.pdf.

Elidan, Gal, and Nir Friedman. 2005. â€śLearning Hidden Variable Networks: The Information Bottleneck Approach.â€ť Journal of Machine Learning Research 6: 81â€“127.

Ellerman, David. 2017. â€śLogical Information Theory: New Foundations for Information Theory.â€ť arXiv:1707.04728 [Quant-Ph], May. http://www.ellerman.org/wp-content/uploads/2017/12/New_Foundations4IT_reprint.pdf.

Feldman, David P, and James P Crutchfield. 2004. â€śSynchronizing to Periodicity: The Transient Information and Synchronization Time of Periodic Sequences.â€ť Advances in Complex Systems 7 (03): 329â€“55. https://doi.org/10.1142/S0219525904000196.

Frazier, Peter I, and Warren B Powell. 2010. â€śParadoxes in Learning and the Marginal Value of Information.â€ť Decision Analysis 7 (4): 378â€“403. https://doi.org/10.1287/deca.1100.0190.

Friedman, Nir, Ori Mosenzon, Noam Slonim, and Naftali Tishby. 2001. â€śMultivariate Information Bottleneck.â€ť In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, 152â€“61. UAIâ€™01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://arxiv.org/abs/1301.2270.

Friston, Karl. 2010. â€śThe Free-Energy Principle: A Unified Brain Theory?â€ť Nature Reviews Neuroscience 11 (2): 127. https://doi.org/10.1038/nrn2787.

Gao, Shuyang, Greg Ver Steeg, and Aram Galstyan. 2015. â€śEfficient Estimation of Mutual Information for Strongly Dependent Variables.â€ť In Journal of Machine Learning Research, 277â€“86. http://www.jmlr.org/proceedings/papers/v38/gao15.html.

GĂˇcs, PĂ©ter, JohnT. Tromp, and Paul M.B. VitĂˇnyi. 2001. â€śAlgorithmic Statistics.â€ť IEEE Transactions on Information Theory 47 (6): 2443â€“63. https://doi.org/10.1109/18.945257.

Granger, Clive W J. 1963. â€śEconomic Processes Involving Feedback.â€ť Information and Control 6 (1): 28â€“48. https://doi.org/10.1016/S0019-9958(63)90092-5.

Grassberger, Peter. 1988. â€śFinite Sample Corrections to Entropy and Dimension Estimates.â€ť Physics Letters A 128 (6â€“7): 369â€“73. https://doi.org/10.1016/0375-9601(88)90193-4.

Gray, Robert M. 1991. Entropy and Information Theory. New York: Springer-Verlag.

Hausser, Jean, and Korbinian Strimmer. 2009. â€śEntropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks.â€ť Journal of Machine Learning Research 10: 1469.

Haussler, David, and Manfred Opper. 1997. â€śMutual Information, Metric Entropy and Cumulative Relative Entropy Risk.â€ť The Annals of Statistics 25 (6): 2451â€“92. https://doi.org/10.1214/aos/1030741081.

Hirata, Hironori, and Robert E Ulanowicz. 1985. â€śInformation Theoretical Analysis of the Aggregation and Hierarchical Structure of Ecological Networks.â€ť Journal of Theoretical Biology 116 (3): 321â€“41. https://doi.org/10.1016/S0022-5193(85)80271-X.

Jaynes, Edwin Thompson. 1963. â€śInformation Theory and Statistical Mechanics.â€ť In Statistical Physics. Vol. 3. Brandeis University Summer Institute Lectures in Theoretical Physics.

â€”â€”â€”. 1965. â€śGibbs Vs Boltzmann Entropies.â€ť American Journal of Physics 33: 391â€“98. https://doi.org/10.1119/1.1971557.

Kandasamy, Kirthevasan, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, and James M. Robins. 2014. â€śInfluence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations,â€ť November. http://arxiv.org/abs/1411.4342.

Keane, M. S., and George L. Oâ€™Brien. 1994. â€śA Bernoulli Factory.â€ť ACM Trans. Model. Comput. Simul. 4 (2): 213â€“19. https://doi.org/10.1145/175007.175019.

Kelly Jr, J L. 1956. â€śA New Interpretation of Information Rate.â€ť Bell System Technical Journal 35 (3): 917â€“26.

Klir, George J. 2006. Uncertainty and Information. Wiley Online Library.

Kolmogorov, A N. 1968. â€śThree Approaches to the Quantitative Definition of Information.â€ť International Journal of Computer Mathematics 2 (1): 157â€“68.

Kraskov, Alexander, Harald StĂ¶gbauer, and Peter Grassberger. 2004. â€śEstimating Mutual Information.â€ť Physical Review E 69: 066138. https://doi.org/10.1103/PhysRevE.69.066138.

Krause, Andreas, and Carlos Guestrin. 2009. â€śOptimal Value of Information in Graphical Models.â€ť J. Artif. Int. Res. 35 (1): 557â€“91.

Kullback, S, and R A Leibler. 1951. â€śOn Information and Sufficiency.â€ť The Annals of Mathematical Statistics 22 (1): 79â€“86.

Langton, Chris G. 1990. â€śComputation at the Edge of Chaos: Phase Transitions and Emergent Computation.â€ť Physica D: Nonlinear Phenomena 42 (1â€“3): 12â€“37. https://doi.org/10.1016/0167-2789(90)90064-V.

Lecomte, V., C. Appert-Rolland, and F. van Wijland. 2007. â€śThermodynamic Formalism for Systems with Markov Dynamics.â€ť Journal of Statistical Physics 127 (1): 51â€“106. https://doi.org/10.1007/s10955-006-9254-0.

Leonenko, Nikolai. 2008. â€śA Class of RĂ©nyi Information Estimators for Multidimensional Densities.â€ť The Annals of Statistics 36 (5): 2153â€“82. https://doi.org/10.1214/07-AOS539.

Leskovec, Jure. 2012. â€śInformation Diffusion and External Influence in Networks,â€ť June. http://arxiv.org/abs/1206.1331.

Liese, F, and I Vajda. 2006. â€śOn Divergences and Informations in Statistics and Information Theory.â€ť IEEE Transactions on Information Theory 52 (10): 4394â€“4412. https://doi.org/10.1109/TIT.2006.881731.

Lin, Jianhua. 1991. â€śDivergence Measures Based on the Shannon Entropy.â€ť IEEE Transactions on Information Theory 37 (1): 145â€“51. https://doi.org/10.1109/18.61115.

Lizier, Joseph T, and Mikhail Prokopenko. 2010. â€śDifferentiating Information Transfer and Causal Effect.â€ť The European Physical Journal B - Condensed Matter and Complex Systems 73 (4): 605â€“15. https://doi.org/10.1140/epjb/e2010-00034-5.

Lizier, Joseph T, Mikhail Prokopenko, and Albert Y Zomaya. 2008. â€śLocal Information Transfer as a Spatiotemporal Filter for Complex Systems.â€ť Physical Review E 77: 026110. https://doi.org/10.1103/PhysRevE.77.026110.

Marton, Katalin. 2015. â€śLogarithmic Sobolev Inequalities in Discrete Product Spaces: A Proof by a Transportation Cost Distance,â€ť July. http://arxiv.org/abs/1507.02803.

Maynard Smith, John. 2000. â€śThe Concept of Information in Biology.â€ť Philosophy of Science 67 (2): 177â€“94.

Moon, Kevin R., and Alfred O. Hero III. 2014. â€śMultivariate F-Divergence Estimation with Confidence.â€ť In NIPS 2014. http://arxiv.org/abs/1411.2045.

Nemenman, Ilya, Fariel Shafee, and William Bialek. 2001. â€śEntropy and Inference, Revisited.â€ť In. http://arxiv.org/abs/physics/0108025.

Odum, Howard T. 1988. â€śSelf-Organization, Transformity, and Information.â€ť Science, 1988.

Palomar, D.P., and S. Verdu. 2008. â€śLautum Information.â€ť IEEE Transactions on Information Theory 54 (3): 964â€“75. https://doi.org/10.1109/TIT.2007.915715.

Paninski, Liam. 2003. â€śEstimation of Entropy and Mutual Information.â€ť Neural Computation 15 (6): 1191â€“1253. https://doi.org/10.1162/089976603321780272.

Parry, William. 1964. â€śIntrinsic Markov Chains.â€ť Transactions of the American Mathematical Society 112 (1): 55â€“66. https://doi.org/10.1090/S0002-9947-1964-0161372-1.

Phat, Vu N., Nguyen T. Thanh, and Hieu Trinh. 2014. â€śFull-Order Observer Design for Nonlinear Complex Large-Scale Systems with Unknown Time-Varying Delayed Interactions.â€ť Complexity, August, n/aâ€“n/a. https://doi.org/10.1002/cplx.21584.

Pinkerton, Richard C. 1956. â€śInformation Theory and Melody.â€ť Scientific American 194 (2): 77â€“86. https://doi.org/10.1038/scientificamerican0256-77.

Plotkin, Joshua B, and Martin A Nowak. 2000. â€śLanguage Evolution and Information Theory.â€ť Journal of Theoretical Biology 205: 147â€“59. https://doi.org/10.1006/jtbi.2000.2053.

Raginsky, M. 2011. â€śDirected Information and Pearlâ€™s Causal Calculus.â€ť In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), 958â€“65. https://doi.org/10.1109/Allerton.2011.6120270.

Raginsky, Maxim, and Igal Sason. 2012. â€śConcentration of Measure Inequalities in Information Theory, Communications and Coding.â€ť Foundations and Trends in Communications and Information Theory, December. http://arxiv.org/abs/1212.4663.

Rissanen, Jorma. 2007. Information and Complexity in Statistical Modeling. Information Science and Statistics. New York: Springer. http://www.springer.com/mathematics/probability/book/978-0-387-36610-4.

Roulston, Mark S. 1999. â€śEstimating the Errors on Measured Entropy and Mutual Information.â€ť Physica D: Nonlinear Phenomena 125 (3-4): 285â€“94. https://doi.org/10.1016/S0167-2789(98)00269-3.

Ryabko, Daniil, and Boris Ryabko. 2010. â€śNonparametric Statistical Inference for Ergodic Processes.â€ť IEEE Transactions on Information Theory 56 (3): 1430â€“5. https://doi.org/10.1109/TIT.2009.2039169.

Schreiber, Thomas. 2000. â€śMeasuring Information Transfer.â€ť Physical Review Letters 85 (2): 461â€“64.

SchĂĽrmann, Thomas. 2015. â€śA Note on Entropy Estimation.â€ť Neural Computation 27 (10): 2097â€“2106. https://doi.org/10.1162/NECO_a_00775.

Sethna, James P. 2006. Statistical Mechanics: Entropy, Order Parameters, and Complexity. Oxford University Press, USA.

Shalizi, Cosma Rohilla. n.d. â€śInformation Theory.â€ť

â€”â€”â€”. n.d. â€śOptimal Prediction.â€ť

Shalizi, Cosma Rohilla, and James P. Crutchfield. 2002. â€śInformation Bottlenecks, Causal States, and Statistical Relevance Bases: How to Represent Relevant Information in Memoryless Transduction.â€ť Advances in Complex Systems 05 (01): 91â€“95. https://doi.org/10.1142/S0219525902000481.

Shalizi, Cosma Rohilla, Robert Haslinger, Jean-Baptiste Rouquier, Kristina L Klinkner, and Cristopher Moore. 2006. â€śAutomatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems.â€ť Physical Review E 73 (3). https://doi.org/10.1103/PhysRevE.73.036104.

Shalizi, Cosma Rohilla, and Cristopher Moore. 2003. â€śWhat Is a Macrostate? Subjective Observations and Objective Dynamics.â€ť Eprint arXiv:Cond-Mat/0303625. http://arxiv.org/abs/cond-mat/0303625.

Shannon, Claude E. 1948. â€śA Mathematical Theory of Communication.â€ť The Bell Syst Tech J 27: 379â€“423.

Shibata, Ritei. 1997. â€śBootstrap Estimate of Kullback-Leibler Information for Model Selection.â€ť Statistica Sinica 7: 375â€“94.

Shields, P C. 1998. â€śThe Interactions Between Ergodic Theory and Information Theory.â€ť IEEE Transactions on Information Theory 44 (6): 2079â€“93. https://doi.org/10.1109/18.720532.

Singh, Nirvikar. 1985. â€śMonitoring and Hierarchies: The Marginal Value of Information in a Principal-Agent Model.â€ť Journal of Political Economy 93 (3): 599â€“609.

Slonim, Noam, Gurinder S Atwal, GaĹˇper TkaÄŤik, and William Bialek. 2005. â€śInformation-Based Clustering.â€ť Proceedings of the National Academy of Sciences of the United States of America 102: 18297â€“18302. https://doi.org/10.1073/pnas.0507432102.

Slonim, Noam, Nir Friedman, and Naftali Tishby. 2006. â€śMultivariate Information Bottleneck.â€ť Neural Computation 18 (8): 1739â€“89. https://doi.org/10.1162/neco.2006.18.8.1739.

Slonim, Noam, and Naftali Tishby. 2000. â€śAgglomerative Information Bottleneck.â€ť Advances in Neural Information Processing Systems 12: 617â€“23.

Smith, D Eric. 2008a. â€śThermodynamics of Natural Selection I: Energy Flow and the Limits on Organization.â€ť Journal of Theoretical Biology 252: 185â€“97. https://doi.org/10.1016/j.jtbi.2008.02.010.

â€”â€”â€”. 2008b. â€śThermodynamics of Natural Selection II: Chemical Carnot Cycles.â€ť Journal of Theoretical Biology 252: 198â€“212. https://doi.org/10.1016/j.jtbi.2008.02.008.

â€”â€”â€”. 2008c. â€śThermodynamics of Natural Selection III: Landauerâ€™s Principle in Computation and Chemistry.â€ť Journal of Theoretical Biology 252 (2): 213â€“20. https://doi.org/10.1016/j.jtbi.2008.02.013.

Spence, Michael. 2002. â€śSignaling in Retrospect and the Informational Structure of Markets.â€ť American Economic Review 92: 434â€“59. https://doi.org/10.1257/00028280260136200.

Steeg, Greg Ver, and Aram Galstyan. 2015. â€śThe Information Sieve,â€ť July. http://arxiv.org/abs/1507.02284.

Strong, Steven P, Roland Koberle, Rob R de Ruyter van Steveninck, and William Bialek. 1998. â€śEntropy and Information in Neural Spike Trains.â€ť Phys. Rev. Lett. 80 (1): 197â€“200. https://doi.org/10.1103/PhysRevLett.80.197.

StudenĂ˝, Milan. 2016. â€śBasic Facts Concerning Supermodular Functions,â€ť December. http://arxiv.org/abs/1612.06599.

StudenĂ˝, Milan, and JiĹ™ina VejnarovĂˇ. 1998. â€śOn Multiinformation Function as a Tool for Measuring Stochastic Dependence.â€ť In Learning in Graphical Models, 261â€“97. Cambridge, Mass.: MIT Press.

Taylor, Samuel F, Naftali Tishby, and William Bialek. 2007. â€śInformation and Fitness.â€ť Arxiv Preprint arXiv:0712.4382.

Tishby, Naftali, Fernando C Pereira, and William Bialek. 2000. â€śThe Information Bottleneck Method,â€ť April. http://arxiv.org/abs/physics/0004057.

Tishby, Naftali, and Daniel Polani. 2011. â€śInformation Theory of Decisions and Actions.â€ť In PERCEPTION-ACTION CYCLE, 601â€“36. Springer.

Tumer, Kagan, and David H Wolpert. 2004. â€śCoordination in Large Collectives- Chapter 1.â€ť In.

Vereshchagin, N.K., and Paul MB VitĂˇnyi. 2004. â€śKolmogorovâ€™s Structure Functions and Model Selection.â€ť IEEE Transactions on Information Theory 50 (12): 3265â€“90. https://doi.org/10.1109/TIT.2004.838346.

â€”â€”â€”. 2010. â€śRate Distortion and Denoising of Individual Data Using Kolmogorov Complexity.â€ť IEEE Transactions on Information Theory 56 (7): 3438â€“54. https://doi.org/10.1109/TIT.2010.2048491.

VitĂˇnyi, Paul M. 2006. â€śMeaningful Information.â€ť IEEE Transactions on Information Theory 52 (10): 4617â€“26. https://doi.org/10.1109/TIT.2006.881729.

Weidmann, Claudio, and Martin Vetterli. 2012. â€śRate Distortion Behavior of Sparse Sources.â€ť IEEE Transactions on Information Theory 58 (8): 4969â€“92. https://doi.org/10.1109/TIT.2012.2201335.

Weissman, T., Y. H. Kim, and H. H. Permuter. 2013. â€śDirected Information, Causal Estimation, and Communication in Continuous Time.â€ť IEEE Transactions on Information Theory 59 (3): 1271â€“87. https://doi.org/10.1109/TIT.2012.2227677.

Wolf, David R., and David H. Wolpert. 1994. â€śEstimating Functions of Distributions from A Finite Set of Samples, Part 2: Bayes Estimators for Mutual Information, Chi-Squared, Covariance and Other Statistics,â€ť March. http://arxiv.org/abs/comp-gas/9403002.

Wolpert, David H. 2006a. â€śInformation Theoryâ€“the Bridge Connecting Bounded Rational Game Theory and Statistical Physics.â€ť In Complex Engineered Systems, 262â€“90. Understanding Complex Systems. Springer Berlin Heidelberg. http://arxiv.org/abs/cond-mat/0402508.

â€”â€”â€”. 2006b. â€śWhat Information Theory Says About Bounded Rational Best Response.â€ť In The Complex Networks of Economic Interactions, 293â€“306. Lecture Notes in Economics and Mathematical Systems 567. Springer. http://ti.arc.nasa.gov/m/profile/dhw/papers/6.pdf.

Wolpert, David H, and John W Lawson. 2002. â€śDesigning Agent Collectives for Systems with Markovian Dynamics.â€ť In, 1066â€“73. https://doi.org/10.1145/545056.545074.

Wolpert, David H, Kevin R Wheeler, and Kagan Tumer. 1999. â€śGeneral Principles of Learning-Based Multi-Agent Systems.â€ť In, 77â€“83. https://doi.org/10.1145/301136.301167.

â€”â€”â€”. 2000. â€śCollective Intelligence for Control of Distributed Dynamical Systems.â€ť EPL (Europhysics Letters) 49: 708. https://doi.org/10.1209/epl/i2000-00208-x.

Wolpert, David H., and David R. Wolf. 1994. â€śEstimating Functions of Probability Distributions from a Finite Set of Samples, Part 1: Bayes Estimators and the Shannon Entropy,â€ť March. http://arxiv.org/abs/comp-gas/9403001.

Xu, Aolin, and Maxim Raginsky. 2017. â€śInformation-Theoretic Analysis of Generalization Capability of Learning Algorithms.â€ť In Advances in Neural Information Processing Systems. http://arxiv.org/abs/1705.07809.

Zhang, Zhiyi, and Michael Grabchak. 2014. â€śNonparametric Estimation of KĂĽllback-Leibler Divergence.â€ť Neural Computation 26 (11): 2570â€“93. https://doi.org/10.1162/NECO_a_00646.