# Statistical learning theory for time series

Usefulness: đź”§
Novelty: đź’ˇ
Uncertainty: đź¤Ş đź¤Ş đź¤Ş
Incompleteness: đźš§ đźš§ đźš§

Statistical learning theory for dependent data such as time series and possibly other dependency structures. But I only know about result for time series

Non-stationary, non-asymptotic bounds please. Keywords: Ergodic, Î±-, Î˛-mixing.

Mohri and Kuznetsov have done lots of work here; See, e.g.Â their NIPS2016 tutorial. There seem to be a lot of types of ergodic/mixing results, about which I know as yet nothing. Notably (Kuznetsov and Mohri 2016, 2015) try to go beyond this setup.

Overview to 2011 in (McDonald, Shalizi, and Schervish 2011b):

(Yu 1994) sets forth many of the uniform ergodic theorems that are needed to derive generalization error bounds for stochastic processes. (Meir 2000) is one of the first papers to construct risk bounds for time series. [â€¦]

More recently, others have provided PAC results for non-IID data. (Steinwart and Christmann 2009) prove an oracle inequality for generic regularized empirical risk minimization algorithms learning from Î±-mixing processes, a fairly general sort of weak serial dependence, getting learning rates for least-squares support vector machines (SVMs) close to the optimal IID rates. (Mohri and Rostamizadeh 2009b) prove stability-based generalization bounds when the data are stationary and Ď†-mixing or Î˛-mixing, strictly generalizing IID results and applying to all stable learning algorithms. [â€¦] (Karandikar and Vidyasagar n.d.) show that if an algorithm is â€śsubadditiveâ€ť and yields a predictor whose risk can be upper bounded when the data are IID, then the same algorithm yields predictors whose risk can be bounded if data are Î˛-mixing. They use this result to derive generalization error bounds in terms of the learning rates for IID data and the Î˛-mixing coefficients.

Bounds that depend on mixing coefficients can be unsatisfying, since even when the mixing coefficient can be estimated from the data, it will depend on the parameters fit to the data. Which will depend on the mixing coefficient.

The only one Iâ€™ve seen which sidesteps this entirely is Kuznetsov and Mohri, ((Kuznetsov and Mohri 2015, 2014); note that these the papers make more sense in reverse order of publication date):

Time series forecasting plays a crucial role in a number of domains ranging from weather forecasting and earthquake prediction to applications in economics and finance. The classical statistical approaches to time series analysis are based on generative models such as the autoregressive moving average (ARMA) models, or their integrated versions (ARIMA) and several other extensions [â€¦]. Most of these models rely on strong assumptions about the noise terms, often assumed to be i.i.d. random variables sampled from a Gaussian distribution, and the guarantees provided in their support are only asymptotic. An alternative non-parametric approach to time series analysis consists of extending the standard i.i.d. statistical learning theory framework to that of stochastic processes.

[â€¦] we consider the general case of non-stationary non-mixing processes. We are not aware of any prior work providing generalization bounds in this setting. In fact, our bounds appear to be novel even when the process is stationary (but not mixing). The learning guarantees that we present hold for both bounded and unbounded memory models. [â€¦] Our guarantees cover the majority of approaches used in practice, including various autoregressive and state space models. The key ingredients of our generalization bounds are a data-dependent measure of sequential complexity (expected sequential covering number or sequential Rademacher complexity [Rakhlin et al., 2010]) and a measure of discrepancy between the sample and target distributions. (Kuznetsov and Mohri 2014) also give generalization bounds in terms of discrepancy. However, unlike the result of (Kuznetsov and Mohri 2014), our analysis does not require any mixing assumptions which are hard to verify in practice. More importantly, under some additional mild assumption, the discrepancy measure that we propose can be estimated from data, which leads to data-dependent learning guarantees for non-stationary non-mixing case.

They still want their time series to have a discrete time index, which can be unsatisfying for continuous processes.

They also work with mixing coefficients (e.g.Â (Kuznetsov and Mohri 2016)) but last time I saw them speak, they were critical of the whole mixing coefficient setting.

# Refs

Agarwal, Anish, Muhammad Jehangir Amjad, Devavrat Shah, and Dennis Shen. 2018. â€śTime Series Analysis via Matrix Estimation,â€ť February. http://arxiv.org/abs/1802.09064.

Alquier, Pierre, Xiaoyin Li, and Olivier Wintenberger. 2013. â€śPrediction of Time Series by Statistical Learning: General Losses and Fast Rates.â€ť Dependence Modeling 1: 65â€“93. https://doi.org/10.2478/demo-2013-0004.

Alquier, Pierre, and Olivier Wintenberger. 2012. â€śModel Selection for Weakly Dependent Time Series Forecasting.â€ť Bernoulli. http://arxiv.org/abs/0902.2924.

Bergmeir, Christoph, Rob J. Hyndman, and Bonsoo Koo. 2015. â€śA Note on the Validity of Cross-Validation for Evaluating Time Series Prediction.â€ť http://business.monash.edu/econometrics-and-business-statistics/research/publications/ebs/wp10-15.pdf.

Cortes, Corinna, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. 2016. â€śStructured Prediction Theory Based on Factor Graph Complexity.â€ť In Advances in Neural Information Processing Systems 29, edited by D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, 2514â€“22. Curran Associates, Inc. http://papers.nips.cc/paper/6485-structured-prediction-theory-based-on-factor-graph-complexity.pdf.

Delft, Anne van, and Michael Eichler. 2016. â€śLocally Stationary Functional Time Series,â€ť February. http://arxiv.org/abs/1602.05125.

Geer, Sara van de. 2002. â€śOn Hoeffdoingâ€™s Inequality for Dependent Random Variables.â€ť In Empirical Process Techniques for Dependent Data. BirkhhĂ¤user.

Gribonval, RĂ©mi, Gilles Blanchard, Nicolas Keriven, and Yann Traonmilin. 2017. â€śCompressive Statistical Learning with Random Feature Moments,â€ť June. http://arxiv.org/abs/1706.07180.

Hardt, Moritz, Tengyu Ma, and Benjamin Recht. 2016. â€śGradient Descent Learns Linear Dynamical Systems,â€ť September. http://arxiv.org/abs/1609.05191.

Hazan, Elad, Karan Singh, and Cyril Zhang. 2017. â€śLearning Linear Dynamical Systems via Spectral Filtering.â€ť In NIPS. http://arxiv.org/abs/1711.00946.

Karandikar, R. L., and M. Vidyasagar. n.d. â€śProbably Approximately Correct Learning with Beta-Mixing Input Sequences.â€ť Accessed June 14, 2017. https://www.math.ucsd.edu/~helton/MTNSHISTORY/CONTENTS/2004LEUVEN/CDROM/papers/201.pdf.

Kontorovich, Leonid (Aryeh), Corinna Cortes, and Mehryar Mohri. 2008. â€śKernel Methods for Learning Languages.â€ť Theoretical Computer Science, Algorithmic Learning Theory, 405 (3): 223â€“36. https://doi.org/10.1016/j.tcs.2008.06.037.

Kontorovich, Leonid, Corinna Cortes, and Mehryar Mohri. 2006. â€śLearning Linearly Separable Languages.â€ť In Algorithmic Learning Theory, edited by JosĂ© L. BalcĂˇzar, Philip M. Long, and Frank Stephan, 288â€“303. Lecture Notes in Computer Science 4264. Springer Berlin Heidelberg. http://link.springer.com/chapter/10.1007/11894841_24.

Kuznetsov, Vitaly, and Mehryar Mohri. 2014. â€śForecasting Non-Stationary Time Series: From Theory to Algorithms.â€ť http://www.cims.nyu.edu/~munoz/multitask/Paper_22_fts.pdf.

â€”â€”â€”. 2015. â€śLearning Theory and Algorithms for Forecasting Non-Stationary Time Series.â€ť In Advances in Neural Information Processing Systems, 541â€“49. Curran Associates, Inc. http://papers.nips.cc/paper/5836-learning-theory-and-algorithms-for-forecasting-non-stationary-time-series.

â€”â€”â€”. 2016. â€śGeneralization Bounds for Non-Stationary Mixing Processes.â€ť In Machine Learning Journal. http://www.cs.nyu.edu/~mohri/pub/nonstatj.pdf.

McDonald, Daniel J., Cosma Rohilla Shalizi, and Mark Schervish. 2011a. â€śGeneralization Error Bounds for Stationary Autoregressive Models,â€ť March. http://arxiv.org/abs/1103.0942.

â€”â€”â€”. 2011b. â€śRisk Bounds for Time Series Without Strong Mixing,â€ť June. http://arxiv.org/abs/1106.0730.

Meir, Ron. 2000. â€śNonparametric Time Series Prediction Through Adaptive Model Selection.â€ť Machine Learning 39 (1): 5â€“34. https://doi.org/10.1023/A:1007602715810.

â€”â€”â€”. 2009b. â€śStability Bounds for Stationary Ď•-Mixing and Î˛-Mixing Processes.â€ť Journal of Machine Learning Research 4: 1â€“26. http://www.jmlr.org/papers/v11/mohri10a.html.

Rakhlin, Alexander, Karthik Sridharan, and Ambuj Tewari. 2014. â€śSequential Complexities and Uniform Martingale Laws of Large Numbers.â€ť Probability Theory and Related Fields 161 (1-2): 111â€“53. https://doi.org/10.1007/s00440-013-0545-5.

Shalizi, Cosma, and Aryeh Kontorovich. 2013. â€śPredictive PAC Learning and Process Decompositions.â€ť In Advances in Neural Information Processing Systems, 1619â€“27. http://papers.nips.cc/paper/5102-predictive-pac-learning-and-process-decompositions.

Simchowitz, Max, Horia Mania, Stephen Tu, Michael I. Jordan, and Benjamin Recht. 2018. â€śLearning Without Mixing: Towards A Sharp Analysis of Linear System Identification,â€ť February. http://arxiv.org/abs/1802.08334.

Steinwart, Ingo, and Andreas Christmann. 2009. â€śFast Learning from Non-Iid Observations.â€ť In Advances in Neural Information Processing Systems, 1768â€“76. http://papers.nips.cc/paper/3736-fast-learning-from-non-iid-observations.

Xu, Aolin, and Maxim Raginsky. 2017. â€śInformation-Theoretic Analysis of Generalization Capability of Learning Algorithms.â€ť In Advances in Neural Information Processing Systems. http://arxiv.org/abs/1705.07809.

Yu, Bin. 1994. â€śRates of Convergence for Empirical Processes of Stationary Mixing Sequences.â€ť The Annals of Probability 22 (1): 94â€“116. https://doi.org/10.1214/aop/1176988849.