Not: what you hope to get from the newspaper. Rather: Different types of (formally defined) entropy/information and their disambiguation. The seductive power of the logarithm and convex functions rather like it.
A proven path to publication is to find or reinvent a derived measure based on Shannon information, and apply it to something provocativesounding. (Qualia! Stock markets! Evolution! Language! The qualia of evolving stock market languages!)
This is purely about the analytic definition given random variables. If you wish to estimate such a quantity empirically, from your experiment, that's adifferent problem.
Connected also to functional equations and yes, statistical mechanics, and quantum information physics.
Shannon Information
Vanilla information, thanks be to Claude Shannon. You have are given a discrete random process of specified parameters. How much can you compress it down to a more parsimonious process? (leaving coding theory aside for the moment.)
Given a random variable taking values from some discrete alphabet , with probability mass function .
Over at the Functional equations page I note that Tom Leinster has a clever proof of the optimality of Shannon information via functional equations.
One interesting aspect of the proof is where the difficulty lies. Let be continuous functions satisfying the chain rule; we have to show that is proportional to . All the effort and ingenuity goes into showing that is proportional to when restricted to the uniform distributions. In other words, the hard part is to show that there exists a constant such that
for all .
Venkatesan Guruswami, Atri Rudra and Madhu Sudan, Essential Coding Theory.
KL divergence
Because “KullbackLeibler divergence” is a lot of syllables for something you use so often, even if usually in sentences like “unlike the KL divergences”. Or you could call it the “relative entropy”, but that sounds like something to do with my uncle after the seventh round of Christmas drinks.
It is defined between the probability mass functions of two discrete random variables, , where those probability mass functions are given q(x)) respectively.
Mutual information
The “informativeness” of one variable given another… Most simply, the KL divergence between the product distribution and the joint distribution of two random variables. (That is, it vanishes if the two variables are independent).
Now, take and with joint probability mass distribution p_X) and .
Then the mutual information is given
Estimating this one has been giving me grief lately, so I'll be happy when I get to this section and solve it forever. See nonparametric mutual information.
Getting an intuition of what this measure does is handy, so I'll expound some equivalent definitions that emphasis different characteristics:
KolmogorovSinai entropy
Schreiber says:
If is obtained by coarse graining a continuous system at resolution , the entropy and entropy rate will depend on the partitioning and in general diverge like when . However, for the special case of a deterministic dynamical system, may exist and is then called the KolmogorovSinai entropy. (For nonMarkov systems, also the limit needs to be taken.)
That is, it is a special case of the entropy rate for a dynamical system.  Cue connection to algorithmic complexity. Also metric entropy?
Alternative formulations and relatives
Rényi Information
Also, the Hartley measure.
You don't need to use a logarithm in your information summation. Free energy, something something. (?)
The observation that many of the attractive features of information measures are simply due to the concavity of the logarithm term in the function. So, why not whack another concave function with even more handy features in there? Bam, you are now working on Rényi information. How do you feel?
Tsallis statistics
Attempting to make information measures “nonextensive”. “qentropy”. Seems to have made a big splash in Brazil, but less in other countries. Nonextensive measures are an intriguing idea, though. I wonder if it's parochialism that keeps everyone off Tsallis statistics, or a lack of demonstrated usefulness?
Fisher information
See maximum likelihood and information criteria.
Estimating information
Wait, you don't know the exact parameters of your generating process a priori? You need to estimate it from data.
To Read

John Baez's A Characterisation of Entropy etc http://johncarlosbaez.wordpress.com/category/informationandentropy/

Daniel Ellerman's History of the Logical Entropy Formula and From Partition Logic to Information Theory, which he has now written up as Elle17.
Refs
 KeOb94: M. S. Keane, George L. O’Brien (1994) A Bernoulli Factory. ACM Trans. Model. Comput. Simul., 4(2), 213–219. DOI
 Leon08: Nikolai Leonenko (2008) A class of Rényi information estimators for multidimensional densities. The Annals of Statistics, 36(5), 2153–2182. DOI
 Shan48: Claude E Shannon (1948) A mathematical theory of communication. The Bell Syst Tech J, 27, 379–423.
 Kell56: J L Kelly Jr (1956) A new interpretation of information rate. Bell System Technical Journal, 35(3), 917–926.
 Schü15: Thomas Schürmann (2015) A Note on Entropy Estimation. Neural Computation, 27(10), 2097–2106. DOI
 BiWo04: Stefan Bieniawski, David H. Wolpert (2004) Adaptive, distributed control of constrained multiagent systems. In Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent SystemsVolume 3 (Vol. 4, pp. 1230–1231). IEEE Computer Society
 SlTi00: Noam Slonim, Naftali Tishby (2000) Agglomerative information bottleneck. Advances in Neural Information Processing Systems, 12, 617–623.
 Chai77: Gregory J Chaitin (1977) Algorithmic information theory. IBM Journal of Research and Development.
 GáTV01: Péter Gács, JohnT. Tromp, Paul M.B. Vitányi (2001) Algorithmic statistics. IEEE Transactions on Information Theory, 47(6), 2443–2463. DOI
 ChLi68: C K Chow, C N Liu (1968) Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14, 462–467. DOI
 SHRK06: Cosma Rohilla Shalizi, Robert Haslinger, JeanBaptiste Rouquier, Kristina L Klinkner, Cristopher Moore (2006) Automatic Filters for the Detection of Coherent Structure in Spatiotemporal Systems. Physical Review E, 73(3). DOI
 Stud16: Milan Studený (2016) Basic facts concerning supermodular functions. ArXiv:1612.06599 [Math, Stat].
 Shib97: Ritei Shibata (1997) Bootstrap estimate of KullbackLeibler information for model selection. Statistica Sinica, 7, 375–394.
 WoWT00: David H Wolpert, Kevin R Wheeler, Kagan Tumer (2000) Collective intelligence for control of distributed dynamical systems. EPL (Europhysics Letters) , 49, 708. DOI
 BiNT01: William Bialek, Ilya Nemenman, Naftali Tishby (2001) Complexity through nonextensivity. Physica A: Statistical and Theoretical Physics, 302(1–4), 89–99. DOI
 Lang90: Chris G. Langton (1990) Computation at the edge of chaos: Phase transitions and emergent computation. Physica D: Nonlinear Phenomena, 42(1–3), 12–37. DOI
 RaSa12: Maxim Raginsky, Igal Sason (2012) Concentration of Measure Inequalities in Information Theory, Communications and Coding. Foundations and Trends in Communications and Information Theory.
 TuWo04: Kagan Tumer, David H Wolpert (2004) Coordination in Large Collectives Chapter 1.
 WoLa02: David H Wolpert, John W Lawson (2002) Designing agent collectives for systems with Markovian dynamics. (pp. 1066–1073). DOI
 LiPr10: Joseph T Lizier, Mikhail Prokopenko (2010) Differentiating information transfer and causal effect. The European Physical Journal B  Condensed Matter and Complex Systems, 73(4), 605–615. DOI
 Ragi11: M. Raginsky (2011) Directed information and Pearl’s causal calculus. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton) (pp. 958–965). DOI
 WeKP13: T. Weissman, Y. H. Kim, H. H. Permuter (2013) Directed Information, Causal Estimation, and Communication in Continuous Time. IEEE Transactions on Information Theory, 59(3), 1271–1287. DOI
 Lin91: Jianhua Lin (1991) Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151. DOI
 Gran63: Clive W J Granger (1963) Economic processes involving feedback. Information and Control, 6(1), 28–48. DOI
 GaVG15: Shuyang Gao, Greg Ver Steeg, Aram Galstyan (2015) Efficient Estimation of Mutual Information for Strongly Dependent Variables. In Journal of Machine Learning Research (pp. 277–286).
 CoTh06: Thomas M Cover, Joy A Thomas (2006) Elements of Information Theory. WileyInterscience
 NeSB01: Ilya Nemenman, Fariel Shafee, William Bialek (2001) Entropy and inference, revisited. In arXiv:physics/0108025.
 SKRB98: Steven P Strong, Roland Koberle, Rob R de Ruyter van Steveninck, William Bialek (1998) Entropy and Information in Neural Spike Trains. Phys. Rev. Lett., 80(1), 197–200. DOI
 Gray91: Robert M Gray (1991) Entropy and Information Theory. New York: SpringerVerlag
 HaSt09: Jean Hausser, Korbinian Strimmer (2009) Entropy Inference and the JamesStein Estimator, with Application to Nonlinear Gene Association Networks. Journal of Machine Learning Research, 10, 1469.
 WoWo94: David R. Wolf, David H. Wolpert (1994) Estimating Functions of Distributions from A Finite Set of Samples, Part 2: Bayes Estimators for Mutual Information, ChiSquared, Covariance and other Statistics. ArXiv:CompGas/9403002.
 WoWo94: David H. Wolpert, David R. Wolf (1994) Estimating Functions of Probability Distributions from a Finite Set of Samples, Part 1: Bayes Estimators and the Shannon Entropy. ArXiv:CompGas/9403001.
 KrSG04: Alexander Kraskov, Harald Stögbauer, Peter Grassberger (2004) Estimating mutual information. Physical Review E, 69, 066138. DOI
 Roul99: Mark S Roulston (1999) Estimating the errors on measured entropy and mutual information. Physica D: Nonlinear Phenomena, 125(3–4), 285–294. DOI
 Pani03: Liam Paninski (2003) Estimation of entropy and mutual information. Neural Computation, 15(6), 1191–1253. DOI
 EaNo98: John Earman, John D Norton (1998) Exorcist XIV: The Wrath of Maxwell’s Demon Part I From Maxwell to Szilard. Studies in History and Philosophy of Modern Physics, 29(4), 435–471. DOI
 EaNo99: John Earman, John D Norton (1999) Exorcist XIV: The Wrath of Maxwell’s Demon Part II From Szilard to Landauer and Beyond. Studies in History and Philosophy of Modern Physics, 30(1), 1–40. DOI
 Gras88: Peter Grassberger (1988) Finite sample corrections to entropy and dimension estimates. Physics Letters A, 128(6–7), 369–373. DOI
 PhTT14: Vu N. Phat, Nguyen T. Thanh, Hieu Trinh (2014) FullOrder observer design for nonlinear complex largescale systems with unknown timevarying delayed interactions. Complexity, n/an/a. DOI
 WoWT99: David H Wolpert, Kevin R Wheeler, Kagan Tumer (1999) General principles of learningbased multiagent systems. (pp. 77–83). DOI
 Jayn65: Edwin Thompson Jaynes (1965) Gibbs vs Boltzmann Entropies. American Journal of Physics, 33, 391–398. DOI
 BaBS09: Lionel Barnett, Adam B. Barrett, Anil K. Seth (2009) Granger Causality and Transfer Entropy Are Equivalent for Gaussian Variables. Physical Review Letters, 103(23), 238701. DOI
 Eich01: Michael Eichler (2001) Grangercausality graphs for multivariate time series. GrangerCausality Graphs for Multivariate Time Series.
 CaMR05: O Cappé, E Moulines, T Ryden (2005) Inference in hidden Markov models. Springer Verlag
 KKPW14: Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, James M. Robins (2014) Influence Functions for Machine Learning: Nonparametric Estimators for Entropies, Divergences and Mutual Informations. ArXiv:1411.4342 [Stat].
 Riss07: Jorma Rissanen (2007) Information and complexity in statistical modeling. New York: Springer
 TaTB07: Samuel F Taylor, Naftali Tishby, William Bialek (2007) Information and fitness. Arxiv Preprint ArXiv:0712.4382.
 Calu02: Cristian S Calude (2002) Information and Randomness : An Algorithmic Perspective. Springer
 ShCr02: Cosma Rohilla Shalizi, James P. Crutchfield (2002) Information bottlenecks, causal states, and statistical relevance bases: how to represent relevant information in memoryless transduction. Advances in Complex Systems, 05(01), 91–95. DOI
 Lesk12: Jure Leskovec (2012) Information Diffusion and External Influence in Networks. Eprint ArXiv:1206.1331.
 Amar01: Shunʼichi Amari (2001) Information geometry on hierarchy of probability distributions. IEEE Transactions on Information Theory, 47, 1701–1711. DOI
 HiUl85: Hironori Hirata, Robert E Ulanowicz (1985) Information theoretical analysis of the aggregation and hierarchical structure of ecological networks. Journal of Theoretical Biology, 116(3), 321–341. DOI
 Shal00a: Cosma Rohilla Shalizi (n.d.a) Information Theory
 Akai73: Hirotogu Akaike (1973) Information Theory and an Extension of the Maximum Likelihood Principle. In Proceeding of the Second International Symposium on Information Theory (pp. 199–213). Budapest: Akademiai Kiado
 Pink56: Richard C. Pinkerton (1956) Information theory and melody. Scientific American, 194(2), 77–86. DOI
 Cohe62: Joel E. Cohen (1962) Information theory and music. Behavioral Science, 7(2), 137–163. DOI
 Jayn63: Edwin Thompson Jaynes (1963) Information Theory and Statistical Mechanics. In Statistical Physics (Vol. 3).
 CsSh04: Imre Csiszár, Paul C Shields (2004) Information theory and statistics: a tutorial. Foundations and Trends™ in Communications and Information Theory, 1(4), 417–528. DOI
 Dewa03: Roderick C Dewar (2003) Information theory explanation of the fluctuation theorem, maximum entropy production and selforganized criticality in nonequilibrium stationary states. Journal of Physics A: Mathematical and General, 36, 631–641. DOI
 TiPo11: Naftali Tishby, Daniel Polani (2011) Information theory of decisions and actions. In PERCEPTIONACTION CYCLE (pp. 601–636). Springer
 Wolp06a: David H Wolpert (2006a) Information TheoryThe Bridge Connecting Bounded Rational Game Theory and Statistical Physics. In Complex Engineered Systems (pp. 262–290). Springer Berlin Heidelberg
 ChDP11: G Chiribella, G M D’Ariano, P Perinotti (2011) Informational derivation of Quantum Theory. Physical Review A, 84(1), 012311. DOI
 SATB05: Noam Slonim, Gurinder S Atwal, Gašper Tkačik, William Bialek (2005) Informationbased clustering. Proceedings of the National Academy of Sciences of the United States of America, 102, 18297–18302. DOI
 XuRa17: Aolin Xu, Maxim Raginsky (2017) Informationtheoretic analysis of generalization capability of learning algorithms. In Advances In Neural Information Processing Systems.
 Parr64: William Parry (1964) Intrinsic Markov chains. Transactions of the American Mathematical Society, 112(1), 55–66. DOI
 CoGG89: Thomas M. Cover, Péter Gács, Robert M. Gray (1989) Kolmogorov’s Contributions to Information Theory and Algorithmic Complexity. The Annals of Probability, 17(3), 840–865. DOI
 VeVi04: N.K. Vereshchagin, Paul MB Vitányi (2004) Kolmogorov’s structure functions and model selection. IEEE Transactions on Information Theory, 50(12), 3265–3290. DOI
 PlNo00: Joshua B Plotkin, Martin A Nowak (2000) Language Evolution and Information Theory. Journal of Theoretical Biology, 205, 147–159. DOI
 PaVe08: D.P. Palomar, S. Verdu (2008) Lautum Information. IEEE Transactions on Information Theory, 54(3), 964–975. DOI
 ElFr05: Gal Elidan, Nir Friedman (2005) Learning Hidden Variable Networks: The Information Bottleneck Approach. Journal of Machine Learning Research, 6, 81–127.
 LiPZ08: Joseph T Lizier, Mikhail Prokopenko, Albert Y Zomaya (2008) Local information transfer as a spatiotemporal filter for complex systems. Physical Review E, 77, 026110. DOI
 Mart15: Katalin Marton (2015) Logarithmic Sobolev inequalities in discrete product spaces: a proof by a transportation cost distance. ArXiv:1507.02803 [Math].
 Elle17: David Ellerman (2017, May 22) Logical Information Theory: New Foundations for Information Theory.
 Vitá06: Paul M Vitányi (2006) Meaningful information. IEEE Transactions on Information Theory, 52(10), 4617–4626. DOI
 Schr00: Thomas Schreiber (2000) Measuring information transfer. Physical Review Letters, 85(2), 461–464.
 Croo07: Gavin E Crooks (2007) Measuring Thermodynamic Length. Physical Review Letters, 99(10), 100602. DOI
 Sing85: Nirvikar Singh (1985) Monitoring and Hierarchies: The Marginal Value of Information in a PrincipalAgent Model. Journal of Political Economy, 93(3), 599–609.
 MoHe14: Kevin R. Moon, Alfred O. Hero III (2014) Multivariate fDivergence Estimation With Confidence. In NIPS 2014.
 BaBS10: Adam B Barrett, Lionel Barnett, Anil K Seth (2010) Multivariate Granger causality and generalized variance. Phys. Rev. E, 81(4), 041907. DOI
 FMST01: Nir Friedman, Ori Mosenzon, Noam Slonim, Naftali Tishby (2001) Multivariate information bottleneck. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence (pp. 152–161). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
 SlFT06: Noam Slonim, Nir Friedman, Naftali Tishby (2006) Multivariate information bottleneck. Neural Computation, 18(8), 1739–1789. DOI
 HaOp97: David Haussler, Manfred Opper (1997) Mutual information, metric entropy and cumulative relative entropy risk. The Annals of Statistics, 25(6), 2451–2492. DOI
 ZhGr14: Zhiyi Zhang, Michael Grabchak (2014) Nonparametric Estimation of KüllbackLeibler Divergence. Neural Computation, 26(11), 2570–2593. DOI
 RyRy10: Daniil Ryabko, Boris Ryabko (2010) Nonparametric Statistical Inference for Ergodic Processes. IEEE Transactions on Information Theory, 56(3), 1430–1435. DOI
 LiVa06: F Liese, I Vajda (2006) On Divergences and Informations in Statistics and Information Theory. IEEE Transactions on Information Theory, 52(10), 4394–4412. DOI
 KuLe51: S Kullback, R A Leibler (1951) On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.
 StVe98: Milan Studený, Jiřina Vejnarová (1998) On multiinformation function as a tool for measuring stochastic dependence. In Learning in graphical models (pp. 261–297). Cambridge, Mass.: MIT Press
 Dahl96: R Dahlhaus (1996) On the KullbackLeibler information divergence of locally stationary processes. Stochastic Processes and Their Applications, 62(1), 139–168. DOI
 Shal00b: Cosma Rohilla Shalizi (n.d.b) Optimal Prediction
 KrGu09: Andreas Krause, Carlos Guestrin (2009) Optimal value of information in graphical models. J. Artif. Int. Res., 35(1), 557–591.
 FrPo10: Peter I Frazier, Warren B Powell (2010) Paradoxes in Learning and the Marginal Value of Information. Decision Analysis, 7(4), 378–403. DOI
 BiNT06: William Bialek, Ilya Nemenman, Naftali Tishby (2006) Predictability, Complexity, and Learning. Neural Computation, 13(11), 2409–2463. DOI
 ABDG08: N. Ay, N. Bertschinger, R. Der, F. Güttler, E. Olbrich (2008) Predictive information and explorative behavior of autonomous robots. The European Physical Journal B  Condensed Matter and Complex Systems, 63(3), 329–339. DOI
 VeVi10: N.K. Vereshchagin, Paul MB Vitányi (2010) Rate Distortion and Denoising of Individual Data Using Kolmogorov Complexity. IEEE Transactions on Information Theory, 56(7), 3438–3454. DOI
 WeVe12: Claudio Weidmann, Martin Vetterli (2012) Rate Distortion Behavior of Sparse Sources. IEEE Transactions on Information Theory, 58(8), 4969–4992. DOI
 Odum88: Howard T Odum (1988) SelfOrganization, Transformity, and Information. Science, 242(4882), 1132–1139.
 Spen02: Michael Spence (2002) Signaling in Retrospect and the Informational Structure of Markets. American Economic Review, 92, 434–459. DOI
 Seth06: James P Sethna (2006) Statistical mechanics: entropy, order parameters, and complexity. Oxford University Press, USA
 FeCr04: David P Feldman, James P Crutchfield (2004) Synchronizing to Periodicity: the Transient Information and Synchronization Time of Periodic Sequences. Advances in Complex Systems, 7(03), 329–355. DOI
 Mayn00: John Maynard Smith (2000) The Concept of Information in Biology. Philosophy of Science, 67(2), 177–194.
 Fris10: Karl Friston (2010) The freeenergy principle: a unified brain theory? Nature Reviews Neuroscience, 11(2), 127. DOI
 TiPB00: Naftali Tishby, Fernando C Pereira, William Bialek (2000) The information bottleneck method. ArXiv:Physics/0004057.
 StGa15: Greg Ver Steeg, Aram Galstyan (2015) The Information Sieve. ArXiv:1507.02284 [Cs, Math, Stat].
 Chai02: Gregory J Chaitin (2002) The intelligibility of the universe and the notions of simplicity, complexity and irreducibility.
 Shie98: P C Shields (1998) The interactions between ergodic theory and information theory. IEEE Transactions on Information Theory, 44(6), 2079–2093. DOI
 LeAW07: V. Lecomte, C. AppertRolland, F. van Wijland (2007) Thermodynamic Formalism for Systems with Markov Dynamics. Journal of Statistical Physics, 127(1), 51–106. DOI
 Smit08a: D Eric Smith (2008a) Thermodynamics of natural selection I: Energy flow and the limits on organization. Journal of Theoretical Biology, 252, 185–197. DOI
 Smit08b: D Eric Smith (2008b) Thermodynamics of natural selection II: Chemical Carnot cycles. Journal of Theoretical Biology, 252, 198–212. DOI
 Smit08c: D Eric Smith (2008c) Thermodynamics of natural selection III: Landauer’s principle in computation and chemistry. Journal of Theoretical Biology, 252(2), 213–220. DOI
 Kolm68: A N Kolmogorov (1968) Three approaches to the quantitative definition of information. International Journal of Computer Mathematics, 2(1), 157–168.
 Klir06: George J Klir (2006) Uncertainty and information. Wiley Online Library
 Wolp06b: David H Wolpert (2006b) What Information Theory says about Bounded Rational Best Response. In The Complex Networks of Economic Interactions (pp. 293–306). Springer
 ShMo03: Cosma Rohilla Shalizi, Cristopher Moore (2003) What Is a Macrostate? Subjective Observations and Objective Dynamics. Eprint ArXiv:CondMat/0303625.