The Living Thing / Notebooks :

Directed graphical models

a.k.a Bayesian Belief networks, Directed graphical models, feedforward neural networks or hierarchical structural models.

Graphs of conditional, directed independence are a convenient formalism for many models.

What’s special here is how we handle independence relations and reasoning about them. In one sense there is nothing special about graphical models; it’s just a graph of which variables are conditionally independent of which others. On the other hand, that graph is a powerful analytic tool, telling you what is confounded with what, and when. Moreover, you can use conditional independence tests to construct that graph even without necessarily constructing the whole model (e.g. ZPJS12).

Once you have the graph, you can infer more detailed relations than mere conditional dependence or otherwise; this is precisely that hierarchical models emphasise.

These can even be causal graphical models, and when we can infer those we are extracting Science (ONO) from observational data. This is really interesting; see causal graphical models

distinction between DAG and Markov graphs

Introductory reading

People recommend me Koller and Friedman, which includes many different flavours of DAG model and many different methods, (KoFr09) but I personally didn’t like it. It drowned me in details without motivation, and left me feeling drained yet uninformed. YMMV.

Spirtes et al (SpGS01) and Pearl (Pear08) are readable. I’ve had Lauritzen (Laur96) recommended too, but haven’t looked at it yet. Murphy’s textbook (Murp12) has a minimal introduction intermixed with some related models, with a more ML, more Bayesian formalism.

Graph inference from data

Much more work.

Learning these models turns out to need a conditional independence test, an awareness of multiple testing and graphs.


Oooh! look! Software!


Aragam, B., Amini, A. A., & Zhou, Q. (2015) Learning Directed Acyclic Graphs with Penalized Neighbourhood Regression. arXiv:1511.08963 [Cs, Math, Stat].
Aragam, B., Gu, J., & Zhou, Q. (2017) Learning Large-Scale Bayesian Networks with the sparsebn Package. arXiv:1703.04025 [Cs, Stat].
Aragam, B., & Zhou, Q. (2015) Concave Penalized Estimation of Sparse Gaussian Bayesian Networks. Journal of Machine Learning Research, 16, 2273–2328.
Aral, S., Muchnik, L., & Sundararajan, A. (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proceedings of the National Academy of Sciences, 106(51), 21544–21549. DOI.
Arnold, B. C., Castillo, E., & Sarabia, J. M.(1999) Conditional specification of statistical models. . Springer Science & Business Media
Bareinboim, E., Tian, J., & Pearl, J. (2014) Recovering from Selection Bias in Causal and Statistical Inference. In AAAI (pp. 2410–2416).
Beal, M. J.(2003) Variational algorithms for approximate Bayesian inference. . University of London
Bloniarz, A., Liu, H., Zhang, C.-H., Sekhon, J., & Yu, B. (2015) Lasso adjustments of treatment effect estimates in randomized experiments. arXiv:1507.03652 [Math, Stat].
Bühlmann, P., Kalisch, M., & Meier, L. (2014) High-Dimensional Statistics with a View Toward Applications in Biology. Annual Review of Statistics and Its Application, 1(1), 255–278. DOI.
Bühlmann, P., Rütimann, P., & Kalisch, M. (2013) Controlling false positive selections in high-dimensional regression and causal inference. Statistical Methods in Medical Research, 22(5), 466–492.
Chen, B., & Pearl, J. (2012) Regression and causation: A critical examination of econometric textbooks.
Christakis, N. A., & Fowler, J. H.(2007) The Spread of Obesity in a Large Social Network over 32 Years. New England Journal of Medicine, 357(4), 370–379. DOI.
Colombo, D., Maathuis, M. H., Kalisch, M., & Richardson, T. S.(2012) Learning high-dimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, 40(1), 294–321.
De Luna, X., Waernbaum, I., & Richardson, T. S.(2011) Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, asr041. DOI.
Edwards, D., & Ankinakatte, S. (2015) Context-specific graphical models for discrete longitudinal data. Statistical Modelling, 15(4), 301–325. DOI.
Fixx, J. F.(1977) Games for the superintelligent. . London: Muller
Frey, B. J., & Jojic, N. (2005) A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(9), 1392–1416. DOI.
Gu, J., Fu, F., & Zhou, Q. (2014) Adaptive Penalized Estimation of Directed Acyclic Graphs From Categorical Data. arXiv:1403.2310 [Stat].
Jordan, Michael I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K.(1999) An Introduction to Variational Methods for Graphical Models. Machine Learning, 37(2), 183–233. DOI.
Jordan, Michael I., & Weiss, Y. (2002a) Graphical models: Probabilistic inference. The Handbook of Brain Theory and Neural Networks, 490–496.
Jordan, Michael I., & Weiss, Y. (2002b) Probabilistic inference in graphical models. Handbook of Neural Networks and Brain Theory.
Jordan, Michael Irwin. (1999) Learning in graphical models. . Cambridge, Mass.: MIT Press
Kalisch, M., & Bühlmann, P. (2007) Estimating High-Dimensional Directed Acyclic Graphs with the PC-Algorithm. Journal of Machine Learning Research, 8, 613–636.
Koller, D., & Friedman, N. (2009) Probabilistic graphical models : principles and techniques. . Cambridge, MA: MIT Press
Krause, A., & Guestrin, C. (2009) Optimal value of information in graphical models. J. Artif. Int. Res., 35(1), 557–591.
Lauritzen, S. L., & Spiegelhalter, D. J.(1988) Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems. Journal of the Royal Statistical Society. Series B (Methodological), 50(2), 157–224.
Lauritzen, Steffen L. (1996) Graphical Models. . Clarendon Press
Maathuis, M. H., & Colombo, D. (2013) A generalized backdoor criterion. arXiv Preprint arXiv:1307.5636.
Malioutov, D. M., Johnson, J. K., & Willsky, A. S.(2006) Walk-Sums and Belief Propagation in Gaussian Graphical Models. Journal of Machine Learning Research, 7, 2031—2064.
Marbach, D., Prill, R. J., Schaffter, T., Mattiussi, C., Floreano, D., & Stolovitzky, G. (2010) Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences, 107(14), 6286–6291. DOI.
Mihalkova, L., & Mooney, R. J.(2007) Bottom-up learning of Markov logic network structure. In Proceedings of the 24th international conference on Machine learning (pp. 625–632). ACM
Montanari, A. (2011) Lecture Notes for Stat 375 Inference in Graphical Models.
Murphy, K. P.(2012) Machine Learning: A Probabilistic Perspective. (1 edition.). Cambridge, MA: The MIT Press
Neapolitan, R. E., & others. (2004) Learning bayesian networks. (Vol. 38). Prentice Hall Upper Saddle River
Pearl, J. (1982) Reverend Bayes on inference engines: a distributed hierarchical approach. In in Proceedings of the National Conference on Artificial Intelligence (pp. 133–136).
Pearl, J. (1986) Fusion, propagation, and structuring in belief networks. Artificial Intelligence, 29(3), 241–288. DOI.
Pearl, J. (2008) Probabilistic reasoning in intelligent systems: networks of plausible inference. (Rev. 2. print., 12. [Dr.].). San Francisco, Calif: Kaufmann
Pereda, E., Quiroga, R. Q., & Bhattacharya, J. (2005) Nonlinear multivariate analysis of neurophysiological signals. Progress in Neurobiology, 77(1–2), 1–37.
Pollard, D. (2004) Hammersley-Clifford theorem for Markov random fields.
Rabbat, M. G., Figueiredo, Má. A. T., & Nowak, R. D.(2008) Network Inference from Co-Occurrences. IEEE Transactions on Information Theory, 54(9), 4053–4068. DOI.
Schmidt, M. (2010) Graphical model structure learning with l1-regularization. . UNIVERSITY OF BRITISH COLUMBIA
Shalizi, C. R., & McFowland III, E. (2016) Controlling for Latent Homophily in Social Networks through Inferring Latent Locations. arXiv:1607.06565 [Physics, Stat].
Smith, D. A., & Eisner, J. (2008) Dependency parsing by belief propagation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 145–156). Association for Computational Linguistics
Spirtes, P., Glymour, C., & Scheines, R. (2001) Causation, Prediction, and Search. (Second Edition.). The MIT Press
Studený, M., & Vejnarová, J. (1998) On multiinformation function as a tool for measuring stochastic dependence. In Learning in graphical models (pp. 261–297). Cambridge, Mass.: MIT Press
Su, R.-Q., Wang, W.-X., & Lai, Y.-C. (2012) Detecting hidden nodes in complex networks from time series. Phys. Rev. E, 85(6), 065201. DOI.
Textor, J., Idelberger, A., & Liśkiewicz, M. (2015) Learning from Pairwise Marginal Independencies. arXiv:1508.00280 [Cs].
Visweswaran, S., & Cooper, G. F.(2014) Counting Markov Blanket Structures. arXiv:1407.2483 [Cs, Stat].
Wainwright, M. J., & Jordan, M. I.(2008) Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2), 1–305. DOI.
Weiss, Y. (2000) Correctness of Local Probability Propagation in Graphical Models with Loops. Neural Computation, 12(1), 1–41. DOI.
Weiss, Y., & Freeman, W. T.(2001) Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology. Neural Computation, 13(10), 2173–2200. DOI.
Winn, J. M., & Bishop, C. M.(2005) Variational message passing. In Journal of Machine Learning Research (pp. 661–694).
Wright, S. (1934) The Method of Path Coefficients. The Annals of Mathematical Statistics, 5(3), 161–215. DOI.
Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2003) Understanding Belief Propagation and Its Generalizations. In G. Lakemeyer & B. Nebel (Eds.), Exploring Artificial Intelligence in the New Millennium (pp. 239–236). Morgan Kaufmann Publishers
Zhang, K., Peters, J., Janzing, D., & Schölkopf, B. (2012) Kernel-based Conditional Independence Test and Application in Causal Discovery. arXiv:1202.3775 [Cs, Stat].