The Living Thing / Notebooks :

Statistical learning theory for dependent data

Statistical learning theory for dependent data such as time series and possibly other dependency structures.

Non-stationary, non-asymptotic bounds please. Keywords: Ergodic, α-, β-mixing.

Mohri and Kuznetsov have done lots of work here; See, e.g. their NIPS2016 tutorial, or KuMo16.

There seem to be a lot of types of ergodic/mixing results, about which I know as yet nothing. Notably KuMo15 and KuMo16 try to go beyond this setup.

Overview in McSS11a:

Yu (Yu94) sets forth many of the uniform ergodic theorems that are needed to derive generalization error bounds for stochastic processes. Meir (Meir00) is one of the first papers to construct risk bounds for time series. […]

More recently, others have provided PAC results for non-IID data. Steinwart and Christmann (StCh09) prove an oracle inequality for generic regularized empirical risk minimization algorithms learning from α-mixing processes, a fairly general sort of weak serial dependence, getting learning rates for least-squares support vector machines (SVMs) close to the optimal IID rates. Mohri and Rostamizadeh (MoRo09b) prove stability-based generalization bounds when the data are stationary and φ-mixing or β-mixing, strictly generalizing IID results and applying to all stable learning algorithms. […] Karandikar and Vidyasagar (KaVi00) show that if an algorithm is “subadditive” and yields a predictor whose risk can be upper bounded when the data are IID, then the same algorithm yields predictors whose risk can be bounded if data are β-mixing. They use this result to derive generalization error bounds in terms of the learning rates for IID data and the β-mixing coefficients.

See also Geer02.


Alquier, P., Li, X., & Wintenberger, O. (2013) Prediction of time series by statistical learning: general losses and fast rates. Dependence Modeling, 1, 65–93. DOI.
Alquier, P., & Wintenberger, O. (2012) Model selection for weakly dependent time series forecasting. Bernoulli.
Cortes, C., Kuznetsov, V., Mohri, M., & Yang, S. (2016) Structured Prediction Theory Based on Factor Graph Complexity. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 29 (pp. 2514–2522). Curran Associates, Inc.
Karandikar, R. L., & Vidyasagar, M. (n.d.) Probably approximately correct learning with beta-mixing input sequences.
Kontorovich, L. (Aryeh), Cortes, C., & Mohri, M. (2008) Kernel methods for learning languages. Theoretical Computer Science, 405(3), 223–236. DOI.
Kontorovich, L., Cortes, C., & Mohri, M. (2006) Learning Linearly Separable Languages. In J. L. Balcázar, P. M. Long, & F. Stephan (Eds.), Algorithmic Learning Theory (pp. 288–303). Springer Berlin Heidelberg
Kuznetsov, V., & Mohri, M. (2014) Forecasting Non-Stationary Time Series: From Theory to Algorithms.
Kuznetsov, V., & Mohri, M. (2015) Learning Theory and Algorithms for Forecasting Non-Stationary Time Series. In Advances in Neural Information Processing Systems (pp. 541–549). Curran Associates, Inc.
Kuznetsov, V., & Mohri, M. (2016) Generalization Bounds for Non-stationary Mixing Processes. In Machine Learning Journal.
McDonald, D. J., Shalizi, C. R., & Schervish, M. (2011a) Generalization error bounds for stationary autoregressive models. ArXiv:1103.0942 [Cs, Stat].
McDonald, D. J., Shalizi, C. R., & Schervish, M. (2011b) Risk bounds for time series without strong mixing. ArXiv:1106.0730 [Cs, Stat].
Meir, R. (2000) Nonparametric Time Series Prediction Through Adaptive Model Selection. Machine Learning, 39(1), 5–34. DOI.
Mohri, M., & Rostamizadeh, A. (2009a) Rademacher complexity bounds for non-iid processes. In Advances in Neural Information Processing Systems (pp. 1097–1104).
Mohri, M., & Rostamizadeh, A. (2009b) Stability Bounds for Stationary ϕ-mixing and β-mixing Processes. Journal of Machine Learning Research, 4, 1–26.
Rakhlin, A., Sridharan, K., & Tewari, A. (2014) Sequential complexities and uniform martingale laws of large numbers. Probability Theory and Related Fields, 161(1–2), 111–153. DOI.
Shalizi, C., & Kontorovich, A. (2013) Predictive PAC learning and process decompositions. In Advances in neural information processing systems (pp. 1619–1627).
Steinwart, I., & Christmann, A. (2009) Fast learning from non-iid observations. In Advances in neural information processing systems (pp. 1768–1776).
van de Geer, S. (2002) On Hoeffdoing’s inequality for dependent random variables. In Empirical Process Techniques for Dependent Data. Birkhhäuser
Yu, B. (1994) Rates of Convergence for Empirical Processes of Stationary Mixing Sequences. The Annals of Probability, 22(1), 94–116. DOI.