tl;dr:These are the notes from a reading group I lead in 2016 on causal DAGs; When I have time to expand these notes into complete sentences, I will migrate the good bits to an expanded and improved notebook on causal DAGS.
We will follow Pearl's summary (Pear09b). (approx sections 13 of the Pearl paper.)
In particular, I want to get to the identification of causal effects given an existing causal DAG from observational data with unobserved covariates via criteria such as the backdoor criterion We'll see.
Approach: casual, motivate Pearl's pronouncements, without deriving everything from axioms. Not statistical; will not answer the question of how we infer graph structure from data. Will skip many complexities by taking several slightly overrestrictive conditions, which we would relax if we were not doing this in 1 hour.
Not covered: UGs, PDAGs…
Assumptions: Noone here is an expert in this DAG graphical formalism for causal inference.
Motivational examples
 Wet pavements
 Obesity contagion
 Nobel prizes and chocolate
 Simpson's paradox
 etc
Machinery
We are interested in representing influence between variables in a nonparametric fashion.
Our main tool to do this will be conditional independence DAGs, and causal use of these. Alternative name: “Bayesian Belief networks”. (Overloads “Bayesian”, so not used here)
DAGs
DAG: Directed (probabilistic) graphical model. Graph defined, as usual, defined by a set of vertexes and edges.
We show the directs of edges by writing them as arrows.
For nodes we write (X\rightarrow Y) to mean there is a directed edged joining them.
Familiar from, e.g., Structural equation models, hierarchical models, expert systems. General graph theory…
A graph with directed edges, and no cycles. (you cannot return to the same starting node traveling only forward along the arrows)
We need some terminology.
 Parents

The parents of a node in a graph are all nodes joined to it by an incoming arrow,
 Children

similarly,
 Coparent

Ancestors and descendants should be clear as well. For convenience, we define
Random variables
I will deal with finite collections of random variables .
For simplicity of exposition, each of the RVs will be supported on , so that we may work with pmfs, and write for the pmf. I may write to mean .
Also we are working with sets of random variables rather than sets of events and the discrete state space reduces the need to discuss sets of events.
Extension to continuous RVs, or arbitrary RVs is trivial for everything I discuss here. (A challenge is if the probabilities are not all strictly positive.)
Motivation in terms of structural models.
Without further information about the forms of or , our assumptions have constrained our conditional independence relations to permit a particular factorization of the mass function:
We are “nonparametric” in the sense that working with this conditional factorization does not require any further parametric assumptions on the model.
However, we would like to proceed from this factorization to conditional independence, which is nontrivial. Specifically, we would like to know which variables are conditionally independent of others, given such an (assumed) factorization.
More notation: We write
for independent of given .
We also use this notation for sets of random variables, and will bold them when it is necessary to emphasis this.
Questions:
 ?
 ?
 ?
However, this product notation is not illuminating; we use a graph formalism. That's where the DAGs come in.
This will proceed in 3 steps

The graphs will describe conditional factorization relations.

We will do some work to construct from these relations some conditional independence relations, which may be read off the graph.

From these relations plus a causal interpretation we will derive rules for identification of causal relations

If we get further than that, it will be all about coffee
Anyway, a joint distribution decomposes according to a directed graph if we may factor it
Uniqueness?
It would be tempting to suppose that a node is independent of its children given its parents or somesuch. But things are not quite so simple.
Questions:
 ?
 ?
 ?
 ?
To make precise statements about conditional independence relations we will do more work.
We need new graph vocabulary and conditional independence vocabulary.
Axiomatic characterisation of conditional independence. (Pear08, Laur96)
Theorem: (Pear08) For disjoint subsets
Then the relation satisfies the following relations:
(*) The Intersection axiom only holds for strictly positive distributions.
Judea Pearl (Pear08):
How can we relate this to the topology of the graph?
The flow of conditional information does not correspond exactly to the marginal factorization, but it relates. (mention UG connections?)
Definition: A set blocks a path from X to Y in a DAG if either

There a node which is not a collider on such that

There a node which is a collider on and
If a path is not blocked, it is active.
Definition: A set dseparates two subsets of nodes if it blocks every path between any every pair of nodes such that
This looks ghastly and unintuitive, but we have to live with it because it is the shortest path to making simple statements about conditional independence DAGs without horrible circumlocutions, or starting from undirected graphs, which is tedious.
Theorem: (Pear08, Laur96) If the joint distribution of factorises according to the DAG then for two subsets of variables iff dseparates and .
This puts us in a position to make nonawful, more intuitive statements about the conditional independence relationships that we may read off the DAG.
Corollary: The DAG Markov property.
Corollary: The DAG Markov blanket.
Define
Then
Causal interpretation
Finally!
We have a DAG and a set of variables to which we wish to give a causal interpretation.
Assume
 The factors according to
 means “causes” (The Causal Markov property)
 We additionally assume faithfulness, that is, that iff there is a path connecting them.
So, are we done? Only if correlation equals causation.
We add the additional condition that
 all the relevant variables are included in the graph. (We coyly avoid making this precise)
The BBC raised on possible confounding variable:
[…]Eric Cornell, who won the Nobel Prize in Physics in 2001, told Reuters “I attribute essentially all my success to the very large amount of chocolate that I consume. Personally I feel that milk chocolate makes you stupid… dark chocolate is the way to go. It's one thing if you want a medicine or chemistry Nobel Prize but if you want a physics Nobel Prize it pretty much has got to be dark chocolate.”
Finally, we need to discuss the relationship between conditional dependence and causal effect. This is the difference between, say,
and
Called “truncated factorization” in the paper. Docalculus and graph surgery.
If we know , this is relatively easy. Marginalize out all influences to the causal variable of interest, which we show graphically as wiping out a link.
Now suppose we are not given complete knowledge of , but only some of the conditional distributions. (there are unobservable variables). This is the setup of observational studies and epidemiology and so on.
What variables must we know the conditional distributions of in order to know the conditional effect? That is, we call a set of covariates an admissible set (or sufficient set) with respect to identifying the effect of on iff
Criterion 1: The parents of a cause are an admissible set (Pear09b)
Criterion 2: The back door criterion.
A set such that


blocks all paths which start with an arrow into
This is a sufficient condition.
Causal properties of sufficient sets:
Hence
Examples
Social influence model in Shalizi and Thomas:
 are individuals,
 denote observed traits,
 denote latent traits
 denote observed outcomes
 is a network tie
dseparates from . Since is latent and unobserved, is a confounding path from to . Likewise is a confounding path from to . Thus, are dconnected when conditioning on all the observed (boxed) variables[…]. Hence the direct effect of on is not identifiable
Handy links
Bonus bits
Equivalently we could do this:
Recommended reading
People recommend me Koller and Friedman, which includes many different flavours of DAG model and many different methods, (KoFr09) but it didn't suit me, being somehow too detailed and too nonspecific at the same time.
Spirtes et al (SpGS01) and Pearl (Pear09b) are readable. see also Pearl's edited highlights (Pear09a). Lauritzen (Laur96) is clear but the details of the constructions are long and detailed and more general than here. (partially directed graphs.)
Lauritzen's shorter introduction (Laur00) is nice if you can get it; Not overwhelming, starts with a slightly more general formalism (DAGs as a special case of PDAGs, moral graphs everywhere). Murphy's textbook (Murp12) has a minimal introduction intermingled with some related models, with a more ML, “expert systems”flavoured and more Bayesian formalism.
Refs
 KiPe83: Jin H. Kim, Judea Pearl (1983) A computational model for causal and diagnostic reasoning in inference systems. In IJCAI (Vol. 83, pp. 190–193). Citeseer
 MaCo13: Marloes H. Maathuis, Diego Colombo (2013) A generalized backdoor criterion. ArXiv Preprint ArXiv:1307.5636.
 MiMo07: Lilyana Mihalkova, Raymond J. Mooney (2007) Bottomup learning of Markov logic network structure. In Proceedings of the 24th international conference on Machine learning (pp. 625–632). ACM
 BaPe16: Elias Bareinboim, Judea Pearl (2016) Causal inference and the datafusion problem. Proceedings of the National Academy of Sciences, 113(27), 7345–7352. DOI
 Laur00: Steffen L. Lauritzen (2000) Causal inference from graphical models. In Complex stochastic systems (pp. 63–107). CRC Press
 Pear09a: Judea Pearl (2009a) Causal inference in statistics: An overview. Statistics Surveys, 3, 96–146. DOI
 PeBM15: Jonas Peters, Peter Bühlmann, Nicolai Meinshausen (2015) Causal inference using invariant prediction: identification and confidence intervals. ArXiv:1501.01332 [Stat].
 ShTc14: Ilya Shpitser, Eric Tchetgen Tchetgen (2014) Causal Inference with a Graphical Hierarchy of Interventions. ArXiv:1411.2127 [Stat].
 Bühl13: Peter Bühlmann (2013) Causal statistical inference in high dimensions. Mathematical Methods of Operations Research, 77(3), 357–370. DOI
 Gelm10: Andrew Gelman (2010) Causality and statistical learning. American Journal of Sociology, 117(3), 955–966. DOI
 Pear09b: Judea Pearl (2009b) Causality: Models, Reasoning and Inference. Cambridge University Press
 SpGS01: Peter Spirtes, Clark Glymour, Richard Scheines (2001) Causation, Prediction, and Search. The MIT Press
 Mess12: Franz H. Messerli (2012) Chocolate Consumption, Cognitive Function, and Nobel Laureates. New England Journal of Medicine, 367(16), 1562–1564. DOI
 ShPe08: Ilya Shpitser, Judea Pearl (2008) Complete identification methods for the causal hierarchy. The Journal of Machine Learning Research, 9, 1941–1979.
 ArCS99: Barry C. Arnold, Enrique Castillo, Jose M. Sarabia (1999) Conditional specification of statistical models. Springer Science & Business Media
 BüRK13: Peter Bühlmann, Philipp Rütimann, Markus Kalisch (2013) Controlling false positive selections in highdimensional regression and causal inference. Statistical Methods in Medical Research, 22(5), 466–492.
 ShMc16: Cosma Rohilla Shalizi, Edward McFowland III (2016) Controlling for Latent Homophily in Social Networks through Inferring Latent Locations. ArXiv:1607.06565 [Physics, Stat].
 DeWR11: Xavier De Luna, Ingeborg Waernbaum, Thomas S. Richardson (2011) Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, asr041. DOI
 SmEi08: David A. Smith, Jason Eisner (2008) Dependency parsing by belief propagation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 145–156). Association for Computational Linguistics
 Ragi11: M. Raginsky (2011) Directed information and Pearl’s causal calculus. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton) (pp. 958–965). DOI
 ArMS09: Sinan Aral, Lev Muchnik, Arun Sundararajan (2009) Distinguishing influencebased contagion from homophilydriven diffusion in dynamic networks. Proceedings of the National Academy of Sciences, 106(51), 21544–21549. DOI
 KaBü07: Markus Kalisch, Peter Bühlmann (2007) Estimating HighDimensional Directed Acyclic Graphs with the PCAlgorithm. Journal of Machine Learning Research, 8, 613–636.
 MaKB09: Marloes H. Maathuis, Markus Kalisch, Peter Bühlmann (2009) Estimating highdimensional intervention effects from observational data. The Annals of Statistics, 37(6A), 3133–3164. DOI
 RuWa06: Donald B Rubin, Richard P Waterman (2006) Estimating the Causal Effects of Marketing Interventions Using Propensity Score Methodology. Statistical Science, 21(2), 206–222. DOI
 Pear86: Judea Pearl (1986) Fusion, propagation, and structuring in belief networks. Artificial Intelligence, 29(3), 241–288. DOI
 Elwe13: Felix Elwert (2013) Graphical causal models. In Handbook of causal analysis for social research (pp. 245–273). Springer
 Laur96: Steffen L. Lauritzen (1996) Graphical Models. Clarendon Press
 JoWe02a: Michael I. Jordan, Yair Weiss (2002a) Graphical models: Probabilistic inference. The Handbook of Brain Theory and Neural Networks, 490–496.
 BüKM14: Peter Bühlmann, Markus Kalisch, Lukas Meier (2014) HighDimensional Statistics with a View Toward Applications in Biology. Annual Review of Statistics and Its Application, 1(1), 255–278. DOI
 ShTh11: Cosma Rohilla Shalizi, Andrew C. Thomas (2011) Homophily and Contagion Are Generically Confounded in Observational Social Network Studies. Sociological Methods & Research, 40(2), 211–239. DOI
 BGKR15: Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, Steven L. Scott (2015) Inferring causal impact using Bayesian structural timeseries models. The Annals of Applied Statistics, 9(1), 247–274. DOI
 ZPJS12: Kun Zhang, Jonas Peters, Dominik Janzing, Bernhard Schölkopf (2012) Kernelbased Conditional Independence Test and Application in Causal Discovery. ArXiv:1202.3775 [Cs, Stat].
 BLZS15: Adam Bloniarz, Hanzhong Liu, CunHui Zhang, Jasjeet Sekhon, Bin Yu (2015) Lasso adjustments of treatment effect estimates in randomized experiments. ArXiv:1507.03652 [Math, Stat].
 NeOt04: Richard E. Neapolitan, others (2004) Learning bayesian networks (Vol. 38). Prentice Hall Upper Saddle River
 HiOB05: Geoffrey E. Hinton, Simon Osindero, Kejie Bao (2005) Learning causally linked markov random fields. In Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics (pp. 128–135). Citeseer
 CMKR12: Diego Colombo, Marloes H. Maathuis, Markus Kalisch, Thomas S. Richardson (2012) Learning highdimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, 40(1), 294–321.
 Jord99: Michael Irwin Jordan (1999) Learning in graphical models. Cambridge, Mass.: MIT Press
 Mont11: Andrea Montanari (2011) Lecture Notes for Stat 375 Inference in Graphical Models
 LaSp88: S. L. Lauritzen, D. J. Spiegelhalter (1988) Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems. Journal of the Royal Statistical Society. Series B (Methodological), 50(2), 157–224.
 Murp12: Kevin P. Murphy (2012) Machine Learning: A Probabilistic Perspective. Cambridge, MA: The MIT Press
 ErBü14: Jan Ernest, Peter Bühlmann (2014) Marginal integration for fully robust causal inference. ArXiv:1405.1868 [Stat].
 VaBC12: Stijn Vansteelandt, Maarten Bekaert, Gerda Claeskens (2012) On model selection and model misspecification in causal inference. Statistical Methods in Medical Research, 21(1), 7–30. DOI
 BPEM14: Peter Bühlmann, Jonas Peters, Jan Ernest, Marloes Maathuis (2014) Predicting causal effects in highdimensional settings
 MCKB10: Marloes H. Maathuis, Diego Colombo, Markus Kalisch, Peter Bühlmann (2010) Predicting causal effects in largescale systems from observational data. Nature Methods, 7(4), 247–248. DOI
 KoFr09: Daphne Koller, Nir Friedman (2009) Probabilistic graphical models : principles and techniques. Cambridge, MA: MIT Press
 JoWe02b: Michael I. Jordan, Yair Weiss (2002b) Probabilistic inference in graphical models. Handbook of Neural Networks and Brain Theory.
 Pear08: Judea Pearl (2008) Probabilistic reasoning in intelligent systems: networks of plausible inference. San Francisco, Calif: Kaufmann
 ClMH14: Tom Claassen, Joris M. Mooij, Tom Heskes (2014) Proof Supplement  Learning Sparse Causal Models is not NPhard (UAI2013). ArXiv:1411.1557 [Stat].
 BaTP14: Elias Bareinboim, Jin Tian, Judea Pearl (2014) Recovering from Selection Bias in Causal and Statistical Inference. In AAAI (pp. 2410–2416).
 ChPe12: B Chen, J Pearl (2012) Regression and causation: A critical examination of econometric textbooks
 MPSM10: Daniel Marbach, Robert J. Prill, Thomas Schaffter, Claudio Mattiussi, Dario Floreano, Gustavo Stolovitzky (2010) Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences, 107(14), 6286–6291. DOI
 Pear82: Judea Pearl (1982) Reverend Bayes on inference engines: a distributed hierarchical approach. In in Proceedings of the National Conference on Artificial Intelligence (pp. 133–136).
 Kenn15: Edward H. Kennedy (2015) Semiparametric theory and empirical processes in causal inference. ArXiv Preprint ArXiv:1510.04740.
 Wrig34: Sewall Wright (1934) The Method of Path Coefficients. The Annals of Mathematical Statistics, 5(3), 161–215. DOI
 NoNy11: Hans Noel, Brendan Nyhan (2011) The “unfriending” problem: The consequences of homophily in friendship retention for causal estimates of social influence. Social Networks, 33(3), 211–218. DOI
 YeFW03: J.S. Yedidia, W.T. Freeman, Y. Weiss (2003) Understanding Belief Propagation and Its Generalizations. In Exploring Artificial Intelligence in the New Millennium (pp. 239–236). Morgan Kaufmann Publishers
 SaVa13: Brian Sauer, Tyler J. VanderWeele (2013) Use of Directed Acyclic Graphs. Agency for Healthcare Research and Quality (US)