a.k.a Bayesian Belief networks, Directed graphical models, feedforward neural networks or hierarchical structural models.
Graphs of conditional, directed independence are a convenient formalism for many models.
What’s special here is how we handle independence relations and reasoning about them. In one sense there is nothing special about graphical models; it’s just a graph of which variables are conditionally independent of which others. On the other hand, that graph is a powerful analytic tool, telling you what is confounded with what, and when. Moreover, you can use conditional independence tests to construct that graph even without necessarily constructing the whole model (e.g. ZPJS12).
Once you have the graph, you can infer more detailed relations than mere conditional dependence or otherwise; this is precisely that hierarchical models emphasise.
These can even be causal graphical models, and when we can infer those we are extracting Science (ONO) from observational data. This is really interesting; see causal graphical models
distinction between DAG and Markov graphs
Introductory reading
People recommend me Koller and Friedman, which includes many different flavours of DAG model and many different methods, (KoFr09) but I personally didn’t like it. It drowned me in details without motivation, and left me feeling drained yet uninformed. YMMV.
Spirtes et al (SpGS01) and Pearl (Pear08) are readable. I’ve had Lauritzen (Laur96) recommended too, but haven’t looked at it yet. Murphy’s textbook (Murp12) has a minimal introduction intermixed with some related models, with a more ML, more Bayesian formalism.
Graph inference from data
Much more work.
Learning these models turns out to need a conditional independence test, an awareness of multiple testing and graphs.
Implementation
Oooh! look! Software!
bnlearn learns belief networks
gmtk learns “dynamic graphical models”, which I believe means the special case of timeseries chain graphs.
A lot of other R packages. TBD  recommendations

A new R package for learning sparse Bayesian networks and other graphical models from highdimensional data via sparse regularization. Designed from the ground up to handle:
 Experimental data with interventions
 Mixed observational / experimental data
 Highdimensional data with p >> n
 Datasets with thousands of variables (tested up to p=8000)
 Continuous and discrete data
The emphasis of this package is scalability and statistical consistency on highdimensional datasets. […] For more details on this package, including worked examples and the methodological background, please see our new preprint [1].
Overview
The main methods for learning graphical models are:
 estimate.dag for directed acyclic graphs (Bayesian networks).
 estimate.precision for undirected graphs (Markov random fields).
 estimate.covariance for covariance matrices.
Currently, estimation of precision and covariances matrices is limited to Gaussian data.
Refs
 ArAZ15
 Aragam, B., Amini, A. A., & Zhou, Q. (2015) Learning Directed Acyclic Graphs with Penalized Neighbourhood Regression. arXiv:1511.08963 [Cs, Math, Stat].
 ArGZ17
 Aragam, B., Gu, J., & Zhou, Q. (2017) Learning LargeScale Bayesian Networks with the sparsebn Package. arXiv:1703.04025 [Cs, Stat].
 ArZh15
 Aragam, B., & Zhou, Q. (2015) Concave Penalized Estimation of Sparse Gaussian Bayesian Networks. Journal of Machine Learning Research, 16, 2273–2328.
 ArMS09
 Aral, S., Muchnik, L., & Sundararajan, A. (2009) Distinguishing influencebased contagion from homophilydriven diffusion in dynamic networks. Proceedings of the National Academy of Sciences, 106(51), 21544–21549. DOI.
 ArCS99
 Arnold, B. C., Castillo, E., & Sarabia, J. M.(1999) Conditional specification of statistical models. . Springer Science & Business Media
 BaTP14
 Bareinboim, E., Tian, J., & Pearl, J. (2014) Recovering from Selection Bias in Causal and Statistical Inference. In AAAI (pp. 2410–2416).
 Beal03
 Beal, M. J.(2003) Variational algorithms for approximate Bayesian inference. . University of London
 BLZS15
 Bloniarz, A., Liu, H., Zhang, C.H., Sekhon, J., & Yu, B. (2015) Lasso adjustments of treatment effect estimates in randomized experiments. arXiv:1507.03652 [Math, Stat].
 BüKM14
 Bühlmann, P., Kalisch, M., & Meier, L. (2014) HighDimensional Statistics with a View Toward Applications in Biology. Annual Review of Statistics and Its Application, 1(1), 255–278. DOI.
 BüRK13
 Bühlmann, P., Rütimann, P., & Kalisch, M. (2013) Controlling false positive selections in highdimensional regression and causal inference. Statistical Methods in Medical Research, 22(5), 466–492.
 ChPe12
 Chen, B., & Pearl, J. (2012) Regression and causation: A critical examination of econometric textbooks.
 ChFo07
 Christakis, N. A., & Fowler, J. H.(2007) The Spread of Obesity in a Large Social Network over 32 Years. New England Journal of Medicine, 357(4), 370–379. DOI.
 CMKR12
 Colombo, D., Maathuis, M. H., Kalisch, M., & Richardson, T. S.(2012) Learning highdimensional directed acyclic graphs with latent and selection variables. The Annals of Statistics, 40(1), 294–321.
 DeWR11
 De Luna, X., Waernbaum, I., & Richardson, T. S.(2011) Covariate selection for the nonparametric estimation of an average treatment effect. Biometrika, asr041. DOI.
 EdAn15
 Edwards, D., & Ankinakatte, S. (2015) Contextspecific graphical models for discrete longitudinal data. Statistical Modelling, 15(4), 301–325. DOI.
 Fixx77
 Fixx, J. F.(1977) Games for the superintelligent. . London: Muller
 FrJo05
 Frey, B. J., & Jojic, N. (2005) A comparison of algorithms for inference and learning in probabilistic graphical models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(9), 1392–1416. DOI.
 GuFZ14
 Gu, J., Fu, F., & Zhou, Q. (2014) Adaptive Penalized Estimation of Directed Acyclic Graphs From Categorical Data. arXiv:1403.2310 [Stat].
 JGJS99
 Jordan, Michael I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K.(1999) An Introduction to Variational Methods for Graphical Models. Machine Learning, 37(2), 183–233. DOI.
 JoWe02a
 Jordan, Michael I., & Weiss, Y. (2002a) Graphical models: Probabilistic inference. The Handbook of Brain Theory and Neural Networks, 490–496.
 JoWe02b
 Jordan, Michael I., & Weiss, Y. (2002b) Probabilistic inference in graphical models. Handbook of Neural Networks and Brain Theory.
 Jord99
 Jordan, Michael Irwin. (1999) Learning in graphical models. . Cambridge, Mass.: MIT Press
 KaBü07
 Kalisch, M., & Bühlmann, P. (2007) Estimating HighDimensional Directed Acyclic Graphs with the PCAlgorithm. Journal of Machine Learning Research, 8, 613–636.
 KoFr09
 Koller, D., & Friedman, N. (2009) Probabilistic graphical models : principles and techniques. . Cambridge, MA: MIT Press
 KrGu09
 Krause, A., & Guestrin, C. (2009) Optimal value of information in graphical models. J. Artif. Int. Res., 35(1), 557–591.
 LaSp88
 Lauritzen, S. L., & Spiegelhalter, D. J.(1988) Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems. Journal of the Royal Statistical Society. Series B (Methodological), 50(2), 157–224.
 Laur96
 Lauritzen, Steffen L. (1996) Graphical Models. . Clarendon Press
 MaCo13
 Maathuis, M. H., & Colombo, D. (2013) A generalized backdoor criterion. arXiv Preprint arXiv:1307.5636.
 MaJW06
 Malioutov, D. M., Johnson, J. K., & Willsky, A. S.(2006) WalkSums and Belief Propagation in Gaussian Graphical Models. Journal of Machine Learning Research, 7, 2031—2064.
 MPSM10
 Marbach, D., Prill, R. J., Schaffter, T., Mattiussi, C., Floreano, D., & Stolovitzky, G. (2010) Revealing strengths and weaknesses of methods for gene network inference. Proceedings of the National Academy of Sciences, 107(14), 6286–6291. DOI.
 MiMo07
 Mihalkova, L., & Mooney, R. J.(2007) Bottomup learning of Markov logic network structure. In Proceedings of the 24th international conference on Machine learning (pp. 625–632). ACM
 Mont11
 Montanari, A. (2011) Lecture Notes for Stat 375 Inference in Graphical Models.
 Murp12
 Murphy, K. P.(2012) Machine Learning: A Probabilistic Perspective. (1 edition.). Cambridge, MA: The MIT Press
 NeOt04
 Neapolitan, R. E., & others. (2004) Learning bayesian networks. (Vol. 38). Prentice Hall Upper Saddle River
 Pear82
 Pearl, J. (1982) Reverend Bayes on inference engines: a distributed hierarchical approach. In in Proceedings of the National Conference on Artificial Intelligence (pp. 133–136).
 Pear86
 Pearl, J. (1986) Fusion, propagation, and structuring in belief networks. Artificial Intelligence, 29(3), 241–288. DOI.
 Pear08
 Pearl, J. (2008) Probabilistic reasoning in intelligent systems: networks of plausible inference. (Rev. 2. print., 12. [Dr.].). San Francisco, Calif: Kaufmann
 PeQB05
 Pereda, E., Quiroga, R. Q., & Bhattacharya, J. (2005) Nonlinear multivariate analysis of neurophysiological signals. Progress in Neurobiology, 77(1–2), 1–37.
 Poll04
 Pollard, D. (2004) HammersleyClifford theorem for Markov random fields.
 RaFN08
 Rabbat, M. G., Figueiredo, Má. A. T., & Nowak, R. D.(2008) Network Inference from CoOccurrences. IEEE Transactions on Information Theory, 54(9), 4053–4068. DOI.
 Schm10
 Schmidt, M. (2010) Graphical model structure learning with l1regularization. . UNIVERSITY OF BRITISH COLUMBIA
 ShMc16
 Shalizi, C. R., & McFowland III, E. (2016) Controlling for Latent Homophily in Social Networks through Inferring Latent Locations. arXiv:1607.06565 [Physics, Stat].
 SmEi08
 Smith, D. A., & Eisner, J. (2008) Dependency parsing by belief propagation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (pp. 145–156). Association for Computational Linguistics
 SpGS01
 Spirtes, P., Glymour, C., & Scheines, R. (2001) Causation, Prediction, and Search. (Second Edition.). The MIT Press
 StVe98
 Studený, M., & Vejnarová, J. (1998) On multiinformation function as a tool for measuring stochastic dependence. In Learning in graphical models (pp. 261–297). Cambridge, Mass.: MIT Press
 SuWL12
 Su, R.Q., Wang, W.X., & Lai, Y.C. (2012) Detecting hidden nodes in complex networks from time series. Phys. Rev. E, 85(6), 065201. DOI.
 TeIL15
 Textor, J., Idelberger, A., & Liśkiewicz, M. (2015) Learning from Pairwise Marginal Independencies. arXiv:1508.00280 [Cs].
 ViCo14
 Visweswaran, S., & Cooper, G. F.(2014) Counting Markov Blanket Structures. arXiv:1407.2483 [Cs, Stat].
 WaJo08
 Wainwright, M. J., & Jordan, M. I.(2008) Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2), 1–305. DOI.
 Weis00
 Weiss, Y. (2000) Correctness of Local Probability Propagation in Graphical Models with Loops. Neural Computation, 12(1), 1–41. DOI.
 WeFr01
 Weiss, Y., & Freeman, W. T.(2001) Correctness of Belief Propagation in Gaussian Graphical Models of Arbitrary Topology. Neural Computation, 13(10), 2173–2200. DOI.
 WiBi05
 Winn, J. M., & Bishop, C. M.(2005) Variational message passing. In Journal of Machine Learning Research (pp. 661–694).
 Wrig34
 Wright, S. (1934) The Method of Path Coefficients. The Annals of Mathematical Statistics, 5(3), 161–215. DOI.
 YeFW03
 Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2003) Understanding Belief Propagation and Its Generalizations. In G. Lakemeyer & B. Nebel (Eds.), Exploring Artificial Intelligence in the New Millennium (pp. 239–236). Morgan Kaufmann Publishers
 ZPJS12
 Zhang, K., Peters, J., Janzing, D., & Schölkopf, B. (2012) Kernelbased Conditional Independence Test and Application in Causal Discovery. arXiv:1202.3775 [Cs, Stat].