# UNSW Stats reading group 2016 - Causal DAGs

### An introduction to conditional independence DAGs and their use for causal data.

Note

These are the notes from a reading group I lead in 2016 on causal DAGs; When I have time to expand these notes into complete sentences, I will migrate the good bits to an expanded and improved notebook on causal DAGS.

We will follow Pearl’s summary (Pear09b). (approx sections 1-3 of the Pearl paper.)

In particular, I want to get to the identification of causal effects given an existing causal DAG from observational data with unobserved covariates via criteria such as the back-door criterion We’ll see.

Approach: casual, motivate Pearl’s pronouncements, without deriving everything from axioms. Not statistical; will not answer the question of how we infer graph structure from data. Will skip many complexities by taking several slightly over-restrictive conditions, which we would relax if we were not doing this in 1 hour.

Not covered: UGs, PDAGs…

Assumptions: No-one here is an expert in this DAG graphical formalism for causal inference.

## Motivational examples

• Wet pavements
• Obesity contagion
• Nobel prizes and chocolate
• etc

## Machinery

We are interested in representing influence between variables in a non-parametric fashion.

Our main tool to do this will be conditional independence DAGs, and causal use of these. Alternative name: “Bayesian Belief networks”. (Overloads “Bayesian”, so not used here)

### DAGs

DAG: Directed (probabilistic) graphical model. Graph defined, as usual, defined by a set of vertexes and edges.

\begin{equation*} \mathcal{G}=(\mathbf{V},E) \end{equation*}

We show the directs of edges by writing them as arrows.

for nodes $X,Y\in V$ e write $X\rightarrow Y$ to mean there is a directed edged joining them.

Familiar from, e.g., Structural equation models, hierarchical models, expert systems. General graph theory…

A graph with directed edges, and no cycles. (you cannot return to the same starting node traveling only forward along the arrows)

We need some terminology.

Parents
The parents of a node $X$ in a graph are all nodes joined to it by an incoming arrow, $\operatorname{parents}(X)=\{Y\in V:Y\rightarrow X\}.$
Children
similarly, $\operatorname{parents}(X)=\{Y\in V:X\rightarrow Y\}.$
Co-parent
$\operatorname{coparents}(X)=\{Y\in V:\exists Z\in V \text{ s.t. } X\rightarrow Z\text{ and }Y\rightarrow Z\}.$

Ancestors and descendants should be clear as well. For convenience, we define $X\in\operatorname{parents}(X)$

### Random variables

I will deal with finite collections of random variables $\mathbf{V}$.

For simplicity of exposition, each of the RVs will be supported on $\mathcal{X}_i\subset\mathbb{Z}$, so that we may work with pmfs, and write $p(X_i|X_j)$ for the pmf. I may write $p(x_i|x_j)$ to mean $p(X_i=x_i|X_j=x_j)$.

Also we are working with sets of random variables rather than sets of events and the discrete state space reduces the need to discuss sets of events.

Extension to continuous RVs, or arbitrary RVs is trivial for everything I discuss here. (A challenge is if the probabilities are not all strictly positive.)

Motivation in terms of structural models.

\begin{align*} X_6 &= f_6(X_4, X_3, \varepsilon_6) \\ X_5 &= f_5(X_4, X_3, \varepsilon_5) \\ X_4 &= f_4(X_3, X_2, X_1, \varepsilon_4) \\ X_3 &= f_3(\varepsilon_3) \\ X_2 &= f_2(X_1, \varepsilon_2) \\ X_1 &= f_3(\varepsilon_1) \\ \end{align*}

Without further information about the forms of $f_i$ or $\varepsilon_i$, our assumptions have constrained our conditional independence relations to permit a particular factorization of the mass function:

\begin{equation*} p(x_6, x_5, x_4, x_3, x_2, x_1) = p(x_1) p(x_2|x_1) p(x_3) p(x_4|x_1, x_2, x_3) p(x_5|x_3,x_4) p(x_6|x_3,x_4) \end{equation*}

We are “nonparametric” in the sense that working with this conditional factorization does not require any further parametric assumptions on the model.

However, we would like to proceed from this factorization to conditional independence, which is non-trivial. Specifically, we would like to know which variables are conditionally independent of others, given such an (assumed) factorization.

More notation: We write

\begin{equation*} X \perp Y|Z \end{equation*}

for $X$ independent of $Y$ given $Z$.

We also use this notation for sets of random variables, and will bold them when it is necessary to emphasis this.

\begin{equation*} \mathbf{X} \perp \mathbf{Y}|\mathbf{Z} \end{equation*}

Questions:

• $X_2\perp X_3$?
• $X_2\perp X_3|X_1$?
• $X_2\perp X_3|X_4$?

However, this product notation is not illuminating; we use a graph formalism. That’s where the DAGs come in.

hierarchical model example DAG

This will proceed in 3 steps

• The graphs will describe conditional factorization relations.
• We will do some work to construct from these relations some conditional independence relations, which may be read off the graph.
• From these relations plus a causal interpretation we will derive rules for identification of causal relations
• If we get further than that, it will be all about coffee

Anyway, a joint distribution $p(\mathbb{X})$ decomposes according to a directed graph $G$ if we may factor it

\begin{equation*} p(X_1,X_2,\dots,X_v)=\prod_{X=1}^v p(X_i|\operatorname{parents}(X_i)) \end{equation*}

Uniqueness?

It would be tempting to suppose that a node is independent of its children given its parents or somesuch. But things are not quite so simple.

Pavement slipperiness is the ubiquitous example.

Questions:

• $\text{Sprinkler}\perp \text{Rain}$?
• $\text{Sprinkler}\perp \text{Rain}|\text{Wet season}$?
• $\text{Sprinkler}\perp \text{Rain}|\text{Wet pavement}$?
• $\text{Sprinkler}\perp \text{Rain}|\text{Wet season}, \text{Wet pavement}$?

To make precise statements about conditional independence relations we will do more work.

We need new graph vocabulary and conditional independence vocabulary.

Axiomatic characterisation of conditional independence. (Pear08, Laur96)

Theorem: (Pear08) For disjoint subsets $\mathbf{W},\mathbf{X},\mathbf{Y},\mathbf{Z}\subseteq\mathbf{V}.$

Then the relation $\cdot\perp\cdot|\cdot$ satisfies the following relations:

\begin{align*} \mathbf{X} \perp \mathbf{Z} |\mathbf{Y} & \Leftrightarrow & \mathbf{Z}\perp \mathbf{X} | \mathbf{Y} && \text{ Symmetry }&\\ \mathbf{X} \perp \mathbf{Y}\cup \mathbf{W} |\mathbf{Z} & \Rightarrow & \mathbf{X} \perp \mathbf{Y} \text{ and } \mathbf{X} \perp \mathbf{W} && \text{ Decomposition }&\\ \mathbf{X} \perp \mathbf{Y}\cup \mathbf{W} |\mathbf{Z} & \Rightarrow & \mathbf{X} \perp \mathbf{Y}|\mathbf{Z}\cup\mathbf{W} && \text{ Weak Union }&\\ \mathbf{X} \perp \mathbf{Y} |\mathbf{Z} \text{ and } \mathbf{X} \perp \mathbf{W}|\mathbf{Z}\cup \mathbf{Y} |\mathbf{Z} & \Rightarrow & \mathbf{X} \perp \mathbf{Y}\cup \mathbf{W}|\mathbf{Z} && \text{ Contraction }&\\ \mathbf{X} \perp \mathbf{Y} |\mathbf{Z}\cup \mathbf{W} \text{ and } \mathbf{X} \perp \mathbf{W} |\mathbf{Z}\cup \mathbf{Y} & \Rightarrow & \mathbf{X}\perp \mathbf{W}\cup\mathbf{Y} | \mathbf{Z} && \text{ Intersection } & (*)\\ \end{align*}

(*) The Intersection axiom only holds for strictly positive distributions.

How can we relate this to the topology of the graph?

The flow of conditional information does not correspond exactly to the marginal factorization, but it relates. (mention UG connections?)

B in a fork path

B in a chain path

B in a collider path

Definition: A set $\mathbf{S}$ blocks a path $\pi$ from X to Y in a DAG $\mathcal{G}$ if either

1. There a node $a\in\pi$ which is not a collider on $\pi$ such that $a\in\mathbf{S}$
2. There a node $b\in\pi$ which is a collider on $\pi$ and $\operatorname{descendants}(b)\cap\mathbf{S}=\emptyset$

If a path is not blocked, it is active.

Definition: A set $\mathbf{S}$ d-separates two subsets of nodes $\mathbf{X},\mathbf{X}\subseteq\mathcal{G}$ if it blocks every path between any every pair of nodes $(A,B)$ such that $A\in\mathbf{X},\, B\in\mathbf{Y}.$

This looks ghastly and unintuitive, but we have to live with it because it is the shortest path to making simple statements about conditional independence DAGs without horrible circumlocutions, or starting from undirected graphs, which is tedious.

Theorem: (Pear08, Laur96) If the joint distribution of $\mathbf{V}$ factorises according to the DAG $\mathbf{G}$ then for two subsets of variables $\mathbf{X}\perp\mathbf{Y}|\mathbf{S}$ iff $\mathbf{S}$ d-separates $\mathbf{X}$ and $\mathbf{Y}$.

This puts us in a position to make non-awful, more intuitive statements about the conditional independence relationships that we may read off the DAG.

Corollary: The DAG Markov property.

\begin{equation*} X \perp \operatorname{descendants}(X)^C|\operatorname{parents}(X) \end{equation*}

Corollary: The DAG Markov blanket.

Define

\begin{equation*} \operatorname{blanket}(X):= \operatorname{parents}(X)\cup \operatorname{children}(X)\cup \operatorname{coparents}(X) \end{equation*}

Then

\begin{equation*} X\perp \operatorname{blanket}(X)^C|\operatorname{blanket}(X) \end{equation*}

## Causal interpretation

Finally!

We have a DAG $\mathcal{G}$ and a set of variables $\mathbf{V}$ to which we wish to give a causal interpretation.

Assume

1. The $\mathbf{V}$ factors according to $\mathcal{G}$
2. $X\rightarrow Y$ means “causes” (The Causal Markov property)
3. We additionally assume faithfulness, that is, that $X\leftrightsquigarrow Y$ iff there is a path connecting them.

So, are we done? Only if correlation equals causation.

Correlation tastes as good as causation.

1. all the relevant variables are included in the graph. (We coyly avoid making this precise)
[…]Eric Cornell, who won the Nobel Prize in Physics in 2001, told Reuters “I attribute essentially all my success to the very large amount of chocolate that I consume. Personally I feel that milk chocolate makes you stupid… dark chocolate is the way to go. It’s one thing if you want a medicine or chemistry Nobel Prize but if you want a physics Nobel Prize it pretty much has got to be dark chocolate.”

Finally, we need to discuss the relationship between conditional dependence and causal effect. This is the difference between, say,

\begin{equation*} P(\text{Wet pavement}|\text{Sprinkler}=on) \end{equation*}

and

\begin{equation*} P(\text{Wet pavement}|\operatorname{do}(\text{Sprinkler}=on)) \end{equation*}

Called “truncated factorization” in the paper. Do-calculus and graph surgery.

If we know $P$, this is relatively easy. Marginalize out all influences to the causal variable of interest, which we show graphically as wiping out a link.

Graph surgery to see - does the sprinkler cause the pavement to be wet?

Now suppose we are not given complete knowledge of $P$, but only some of the conditional distributions. (there are unobservable variables). This is the setup of observational studies and epidemiology and so on.

What variables must we know the conditional distributions of in order to know the conditional effect? That is, we call a set of covariates $\mathbf{S}$ an admissible set (or sufficient set) with respect to identifying the effect of $X$ on $Y$ iff

\begin{equation*} p(Y=y|do(X=x))=\sum_{\mathbf{s}} P(Y=y|X=x,\mathbf{S}=\mathbf{s}) P(\mathbf{S}=\mathbf{s}) \end{equation*}

Criterion 1: The parents of a cause are an admissible set (Pear09b)

Criterion 2: The back door criterion.

A set $\mathbf{S}$ such that

1. $\mathbf{S}\cap\operatorname{descendants}(X)=\emptyset$
2. $\mathbf{S}$ blocks all paths which start with an arrow into $\mathbf{X}$

This is a sufficient condition.

Causal properties of sufficient sets:

\begin{equation*} P(Y=y|\operatorname{do}(X=x),S=s)=P(Y=y|X=x,S=s) \end{equation*}

Hence

\begin{equation*} P(Y=y|\operatorname{do}(X=x),S=s)=\sum_sP(Y=y|X=x,S=s)P(S=s) \end{equation*}

## Examples

$X_i$ d-separates $Yi(t)$ from $A_{ij}$. Since $X_i$ is latent and unobserved, $Y_i(t) \leftarrow X_i \rightarrow A_{ij}$ is a confounding path from $Y_i(t)$ to $A_{ij}$. Likewise $Y_j(t-1)\leftarrow X_j \rightarrow A_{ij}$ is a confounding path from $Yi(t-1)$ to $A_{ij}$. Thus, $Y_i(t)$ and $Y_i(t-1)$ are d-connected when conditioning on all the observed (boxed) variables[…]. Hence the direct effect of $Y_i(t)$ on $Y_i(t-1)$ is not identifiable

## Bonus bits

Equivalently we could do this:

hierarchical model with explicit exogenous noise.

## Refs

ArMS09
Aral, S., Muchnik, L., & Sundararajan, A. (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences, 106(51), 21544–21549. DOI.
ArCS99
Arnold, B. C., Castillo, E., & Sarabia, J. M.(1999) Conditional specification of statistical models. . Springer Science & Business Media
AyPo08
Ay, N., & Polani, D. (2008) Information flows in causal networks. Advances in Complex Systems (ACS), 11(1), 17–41. DOI.
BaPe16
Bareinboim, E., & Pearl, J. (2016) Causal inference and the data-fusion problem. Proceedings of the National Academy of Sciences, 113(27), 7345–7352. DOI.
BaTP14
Bareinboim, E., Tian, J., & Pearl, J. (2014) Recovering from Selection Bias in Causal and Statistical Inference. In AAAI (pp. 2410–2416).
Beal03
Beal, M. J.(2003) Variational algorithms for approximate Bayesian inference. . University of London
BLZS15
Bloniarz, A., Liu, H., Zhang, C.-H., Sekhon, J., & Yu, B. (2015) Lasso adjustments of treatment effect estimates in randomized experiments. arXiv:1507.03652 [Math, Stat].
BGKR15
Brodersen, K. H., Gallusser, F., Koehler, J., Remy, N., & Scott, S. L.(2015) Inferring causal impact using Bayesian structural time-series models. The Annals of Applied Statistics, 9(1), 247–274. DOI.
Bühl13
Bühlmann, P. (2013) Causal statistical inference in high dimensions. Mathematical Methods of Operations Research, 77(3), 357–370.
BüKM14
Bühlmann, P., Kalisch, M., & Meier, L. (2014) High-Dimensional Statistics with a View Toward Applications in Biology. Annual Review of Statistics and Its Application, 1(1), 255–278. DOI.
BPEM00
Bühlmann, P., Peters, J., Ernest, J., & Maathuis, M. (n.d.) Predicting causal effects in high-dimensional settings.
BüRK13
Bühlmann, P., Rütimann, P., & Kalisch, M. (2013) Controlling false positive selections in high-dimensional regression and causal inference. Statistical Methods in Medical Research, 22(5), 466–492.
ChPe12
Chen, B., & Pearl, J. (2012) Regression and causation: A critical examination of econometric textbooks.
ClMH14
Claassen, T., Mooij, J. M., & Heskes, T. (2014) Proof Supplement - Learning Sparse Causal Models is not NP-hard (UAI2013). arXiv:1411.1557 [Stat].
CMKR12
Colombo, D., Maathuis, M. H., Kalisch, M., & Richardson, T. S.(2012) Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, 40(1), 294–321.
DeWR11
De Luna, X., Waernbaum, I., & Richardson, T. S.(2011) Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, asr041. DOI.
Elwe13
Elwert, F. (2013) Graphical causal models. In Handbook of causal analysis for social research (pp. 245–273). Springer
ErBü14
Ernest, J., & Bühlmann, P. (2014) Marginal integration for fully robust causal inference. arXiv:1405.1868 [Stat].
Gelm10
Gelman, A. (2010) Causality and statistical learning. American Journal of Sociology, 117(3), 955–966. DOI.
HiOB05
Hinton, G. E., Osindero, S., & Bao, K. (2005) Learning causally linked markov random fields. In Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (pp. 128–135). Citeseer
Jord99
Jordan, M. I.(1999) Learning in graphical models. . Cambridge, Mass.: MIT Press
JoWe02a
Jordan, M. I., & Weiss, Y. (2002a) Graphical models: Probabilistic inference. The Handbook of Brain Theory and Neural Networks, 490–496.
JoWe02b
Jordan, M. I., & Weiss, Y. (2002b) Probabilistic inference in graphical models. Handbook of Neural Networks and Brain Theory.
KaBü07
Kalisch, M., & Bühlmann, P. (2007) Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm. J. Mach. Learn. Res., 8, 613–636.
Kenn15
Kennedy, E. H.(2015) Semiparametric theory and empirical processes in causal inference. arXiv Preprint arXiv:1510.04740.
KiPe83
Kim, J. H., & Pearl, J. (1983) A computational model for causal and diagnostic reasoning in inference systems. In IJCAI (Vol. 83, pp. 190–193). Citeseer
KoFr09
Koller, D., & Friedman, N. (2009) Probabilistic graphical models : principles and techniques. . Cambridge, MA: MIT Press
Laur96
Lauritzen, S. L.(1996) Graphical Models. . Clarendon Press
Laur00
Lauritzen, S. L.(2000) Causal inference from graphical models. In Complex stochastic systems (pp. 63–107). CRC Press
LaSp88
Lauritzen, S. L., & Spiegelhalter, D. J.(1988) Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems. Journal of the Royal Statistical Society. Series B (Methodological), 50(2), 157–224.
MaCo13
Maathuis, M. H., & Colombo, D. (2013) A generalized backdoor criterion. arXiv Preprint arXiv:1307.5636.
MCKB10
Maathuis, M. H., Colombo, D., Kalisch, M., & Bühlmann, P. (2010) Predicting causal effects in large-scale systems from observational data. Nature Methods, 7(4), 247–248. DOI.
MaKB09
Maathuis, M. H., Kalisch, M., & Bühlmann, P. (2009) Estimating high-dimensional intervention effects from observational data. The Annals of Statistics, 37(6A), 3133–3164. DOI.
MPSM10
Marbach, D., Prill, R. J., Schaffter, T., Mattiussi, C., Floreano, D., & Stolovitzky, G. (2010) Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences, 107(14), 6286–6291. DOI.
Mess12
Messerli, F. H.(2012) Chocolate Consumption, Cognitive Function, and Nobel Laureates. New England Journal of Medicine, 367(16), 1562–1564. DOI.
MiMo07
Mihalkova, L., & Mooney, R. J.(2007) Bottom-up learning of Markov logic network structure. In Proceedings of the 24th international conference on Machine learning (pp. 625–632). ACM
Mont11
Montanari, A. (2011) Lecture Notes for Stat 375 Inference in Graphical Models.
Murp12
Murphy, K. P.(2012) Machine Learning: A Probabilistic Perspective. (1 edition.). Cambridge, MA: The MIT Press
NeOt04
Neapolitan, R. E., & others. (2004) Learning bayesian networks. (Vol. 38). Prentice Hall Upper Saddle River
NoNy11
Noel, H., & Nyhan, B. (2011) The “unfriending” problem: The consequences of homophily in friendship retention for causal estimates of social influence. Social Networks, 33(3), 211–218. DOI.
Pear82
Pearl, J. (1982) Reverend Bayes on inference engines: a distributed hierarchical approach. In in Proceedings of the National Conference on Artificial Intelligence (pp. 133–136).
Pear86
Pearl, J. (1986) Fusion, propagation, and structuring in belief networks. Artificial Intelligence, 29(3), 241–288. DOI.
Pear08
Pearl, J. (2008) Probabilistic reasoning in intelligent systems: networks of plausible inference. (Rev. 2. print., 12. [Dr.].). San Francisco, Calif: Kaufmann
Pear09a
Pearl, J. (2009a) Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146. DOI.
Pear09b
Pearl, J. (2009b) Causality: Models, Reasoning and Inference. . Cambridge University Press
PeBM15
Peters, J., Bühlmann, P., & Meinshausen, N. (2015) Causal inference using invariant prediction: identification and confidence intervals. arXiv:1501.01332 [Stat].
Ragi11
Raginsky, M. (2011) Directed information and Pearl’s causal calculus. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton) (pp. 958–965). DOI.
RuWa06
Rubin, D. B., & Waterman, R. P.(2006) Estimating the Causal Effects of Marketing Interventions Using Propensity Score Methodology. Statistical Science, 21(2), 206–222. DOI.
SaVa13
Sauer, B., & VanderWeele, T. J.(2013) Use of Directed Acyclic Graphs. . Agency for Healthcare Research and Quality (US)
ShMc16
Shalizi, C. R., & McFowland III, E. (2016) Controlling for Latent Homophily in Social Networks through Inferring Latent Locations. arXiv:1607.06565 [Physics, Stat].
ShTh11
Shalizi, C. R., & Thomas, A. C.(2011) Homophily and Contagion Are Generically Confounded in Observational Social Network Studies. Sociological Methods & Research, 40(2), 211–239. DOI.
ShPe08
Shpitser, I., & Pearl, J. (2008) Complete identification methods for the causal hierarchy. The Journal of Machine Learning Research, 9, 1941–1979.
ShTc14
Shpitser, I., & Tchetgen, E. T.(2014) Causal Inference with a Graphical Hierarchy of Interventions. arXiv:1411.2127 [Stat].
SmEi08
Smith, D. A., & Eisner, J. (2008) Dependency parsing by belief propagation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 145–156). Association for Computational Linguistics
SpGS01
Spirtes, P., Glymour, C., & Scheines, R. (2001) Causation, Prediction, and Search, Second Edition. . The MIT Press
VaBC12
Vansteelandt, S., Bekaert, M., & Claeskens, G. (2012) On model selection and model misspecification in causal inference. Statistical Methods in Medical Research, 21(1), 7–30. DOI.
Wrig34
Wright, S. (1934) The Method of Path Coefficients. The Annals of Mathematical Statistics, 5(3), 161–215. DOI.
YeFW03
Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2003) Understanding Belief Propagation and Its Generalizations. In G. Lakemeyer & B. Nebel (Eds.), Exploring Artificial Intelligence in the New Millennium (pp. 239–236). Morgan Kaufmann Publishers
ZPJS12
Zhang, K., Peters, J., Janzing, D., & Schölkopf, B. (2012) Kernel-based Conditional Independence Test and Application in Causal Discovery. arXiv:1202.3775 [Cs, Stat].