Note

**tl;dr** I’m not currently using Transfer Entropy so should not be taken as an expert.
But I have dumped some notes here from an email I, as a statistician, was writing to a physicist explaining why I don’t think it’s a meaningful thing to estimate from data “nonparametrically”. That explanation needs to be written, but I never got around to finishing it.
The key point is that, since you recover the main utility of transfer entropy from other conditional independence tests and with more general interaction structures than discrete-time multivariate series using graphical models,
*and* you have a better estimation theory from data,
one should just cut to the chase and use the well-established graphical model technology that is out there already.
You can use an information-theoretic dependence test in that framing, if that is important to you for whatever reason.

Transfer Entropy between two random processes attempts to quantify a particular subspecies of *Weiner-causation*.
As Granger summarized it

The statement about causality has just two components:

- The cause occurs before the effect; and
- The cause contains information about the effect that is unique, and is in no other variable.

(In practice this “in no other variable” business is usually quietly ignored in favour of “in no other variable that i have to hand”.)

Transfer entropy is is the brainchild of Thomas Schreiber, Peter Grassberger, Andreas Kaiser and others.
It makes a particular assumption of the form of the data (time series)
and the method in which one quantifies dependence;
It is based on the KL-divergence (a kind of information measure) between two different discrete-time stochastic processes.
In the first model, you assume that the two processes are both Markov, but completely independent.
In the second, you assume that that the two sequences are jointly Markov.
The transfer entropy is the KL-divergence between the joint distribution of the *next time step for one sequence* for each model.
Intuitively, it tells us how much predictive power we have lost by assuming that the sequences are independent.

One needs to make this concrete by plugging in
specific assumptions on the form of the process;
One such special type of Wiener-causality, *Granger-causality* is based on ARIMA time series.
Barnett (BaBS09) shows that for the special case of fitting to your continuous processes an autoregressive linear model with Gaussian noise, it is the same as Granger causality.

Or if your time series is a finite discrete random variable you can just use discrete Markov chains, as in LiPr10. Other models are possible, but I haven’t used any such.

## Why do we care about this model of causation?

There is a famous data set from an ancient Santa Fe time series data analysis contest, of ECG and breath data. Transfer entropy has been to this to measure whether heart rate t-causes breath rate or vice versa.
However, if you really which to know whether heart rate *t*-causes breath rate or breath rate *t*-causes heart rate,
at least one experiment to work it out has been done many times:
Stop either someone’s heart or breath for long enough, the other one will stop shortly after.
This is invisible from observational data.

Like all Wiener-causation, TE does not measure causal influence *per se* but predictive utility. “*G*-causation” (Or *t*-causation?)
is not like intuitive causation;
Specifically, we are often *not* only interested in how well we can predict one from the other,
but how we can change overall system behaviour by intervening in it.
This is a complicated and different question than asking about which parts of a system are informative about which others.
See, e.g. causal DAGs.

## Estimating TE of a process from data generate by it

All this is about stochastic processes *for which we know the parameters*, which is actually very artificial. Why do we want to calculate this predictive importance measure for processes that *we already know everything about*?
Blind optimisation of a simulation algorithm perhaps? I guess that’s what CeLZ11 are doing.

More usually, you want to gain insight into some real-world process for which you have the data but imperfect knowledge.

If you have to estimate the transfer entropy between processes with unknown parameters from noisy observations, you have now arrived in the world of statistics.

Having a statistic that we wish to know is one thing. Calculating it given the data rather than a fully specified model is another. How can you estimate it? Which parametric models work? Which nonparametric methods?

Note

**TODO**: mention how to do this.

For now, probably just see RHPK12, who handball the entire thing to
the PC-algorithm for graphical models.
This is probably the best thing to do, since if the question is
“How do I estimate the transfer entropy of a process nonparametrically from observational data?”, the question *should* probably have been
“How do I estimate causality nonparametrically from observational data?”, unless your process comes pre-discretised in time and space,optimisation_sequential_surrogate.rst.rst or you don’t like flexibility in your estimands.

## Refs

- BaBS09
- Barnett, L., Barrett, A. B., & Seth, A. K.(2009) Granger Causality and Transfer Entropy Are Equivalent for Gaussian Variables.
*Physical Review Letters*, 103(23), 238701. DOI. - CeLZ11
- Ceguerra, R. V., Lizier, J. T., & Zomaya, A. Y.(2011) Information storage and transfer in the synchronization process in locally-connected networks. . Presented at the IEEE Symposium Series in Computational Intelligence (SSCI 2011) - IEEE Symposium on Artificial Life,
- GrSS91
- Grassberger, P., Schreiber, T., & Schaffrath, C. (1991) Nonlinear time sequence analysis.
*International Journal of Bifurcation and Chaos*, 1(3), 521–547. DOI. - KaSc04
- Kantz, H., & Schreiber, T. (2004) Nonlinear time series analysis. (2nd ed.). Cambridge, UK ; New York: Cambridge University Press
- LiPr10
- Lizier, J. T., & Prokopenko, M. (2010) Differentiating information transfer and causal effect.
*The European Physical Journal B - Condensed Matter and Complex Systems*, 73(4), 605–615. DOI. - LiPZ08
- Lizier, J. T., Prokopenko, M., & Zomaya, A. Y.(2008) Local information transfer as a spatiotemporal filter for complex systems.
*Physical Review E*, 77, 026110. DOI. - RHPK12
- Runge, J., Heitzig, J., Petoukhov, V., & Kurths, J. (2012) Escaping the Curse of Dimensionality in Estimating Multivariate Transfer Entropy.
*Physical Review Letters*, 108(25). DOI. - Schr00
- Schreiber, T. (2000) Measuring information transfer.
*Physical Review Letters*, 85(2), 461–464.