In system identification, we infer the parameters of a stochastic dynamical system of a certain type, i.e. usually one with feedback, so that we can e.g. simulate it, or deconvolve it to find the inputs and hidden state, maybe using state filters. In statistical terms, this is the parameter inference problem for dynamical systems.
Moreover, it totally works without Gaussian noise; that's just convenient in optimal linear filtering, Kalman filtering isn't rocket science, after all. Also, mathematically Gaussian is a useful crutch if you decide to go to a continuous time index, cf Gaussian processes.
This is the mostly offline version. There is a subnotebook focussing on online recursive estimation.
Intros
Oppenheim and Verghese, Signals, Systems, and Inference is free online.
Martin [Mart99a]:
Consider the basic autoregressive model,
Estimating AR(p) coefficients:
The [power] spectrum is easily obtained from [the above] as
with the intersample spacing.[…] for any given set of data, we need to be able to estimate the AR coeficients conveniently. Three methods for achieving this are the YuleWalker, Burg and Covariance methods. The YuleWalker technique uses the sample autocovariance to obtain the coefficients; the Covariance method defines, for a set of numbers a quantity known as the total forward and backward prediction error power:
and minimises this w.r.t. . As is a quadratic function of , is linear in and so this is a linear optimisation problem. The Burg method is a constrained minimisation of using the Levinson recursion, a computational device derived from the YuleWalker method.
Instrumental variable regression
Unevenly sampled
Model estimation/system identification
You don't know a parameterised model for the data (and hence a precise bandwidth) and you wish to estimate it.
This is a system identification problem, although the nonuniform sampling means that it has an unusual form.
Martin ([Mart99a]) gives this summary:
One could consider the general problem in an approximate way as the missing data problem with a very high proportion of missing data points, but (Jone81, Jone84) this is not very realistic. This has led to the consideration of the continuoustime model […]. Masry (LiMa92) shows that the coefficients in that equation may be obtained from the [irregularly sampled autocorrelation moments, but], the estimation of these requires a large amount of data and the results are asymptotic in the limit of infinite data. The other continuoustime approach is that of Jones (Jone81, Jone84) who has used Kalman recursive estimation […] to obtain a likelihood function which is then maximised w.r.t. b to obtain an estimate of the true parameters.
There is a partial review and comparison of methods in StSa06, and Broe06. From the latter:
Martin (Mart99b) applied autoregressive modeling to irregularly sampled data using a dedicated method. It was particularly good in extracting sinusoids from noise in short data sets. Söderström and Mossberg (SöMo00) evaluated the performance of methods for identifying continuoustime autoregressive processes, which replace the differentiation operator by different approximations. Larsson and Söderström (LaSö02) apply this idea to randomly sampled autoregressive data. They report promising results for loworder processes. Lahalle et al (LaFR04) estimate continuoustime ARMA models. Unfortunately, their method requires explicit use of a model for irregular sampling instants. The precise shape of that distribution is very important for the result, but it is almost impossible to establish it from practical data.
No generally satisfactory spectral estimator for irregular data has been defined yet. Continuous time series models can be estimated for irregular data, and they are the only possible candidates for obtaining the CramérRao lower boundary, because the true process for irregular data is a continuoustime process. Jones (Jone81 has formulated the maximum likelihood estimator for irregular observations. However, Jones (Jone84) also found that the likelihood has several local maxima and the optimisation requires extremely good initial estimates. Broersen and Bos (BrBo06) used the method of Jones to obtain maximum likelihood estimates for irregular data. If simulations started with the true process parameters as initial conditions, that was sometimes, but not always, good enough to converge to the global maximum of the likelihood. However, sometimes even those perfect and nonrealisable starting values were not capable of letting the likelihood converge to an acceptable model. So far, no practical maximum likelihood method for irregular data has solved all numerical problems, and certainly no satisfactory realisable initial conditions can be given. As an example, it has been verified in simulations that taking the estimated AR( p–1) model together with an additional zero for order p as starting values for AR( p) estimation does not always converge to acceptable AR( p) models. The model with the maximum value of the likelihood might not in all cases be accurate and many good models have significantly lower numerical values of the likelihood. Martin ([Mart99a] suggests that the exact likelihood is sensitive to roundoff errors. Broersen and Bos (BrBo06) calculated the likelihood as a function of true model parameters, multiplied by a constant factor. Only the likelihood for a single pole was smooth. Two poles already gave a number of sharp peaks in the likelihood, and three or more poles gave a very rough surface of the likelihood. The scene is full of local minima, and the optimisation cannot find the global minimum, unless it starts very close to it.
Slotting
Asymptotic methods based on gridding observations.
Method of transformed coefficients
Useful tool: equivalence of a continuous time Ito integral and a discrete ARIMA process (attributed by Mart98 to Bart46 also implies you can estimate the model without estimating missing data, which is satisfying, although the precise form this takes is less satisfying.
Popular overviews seem to be PiPe04 and [Mart99b]
State filters
(Note that you can also do the signal reconstruction problem using state filters, but I'm interested here in doing system identification using state filters.) Jones (Jone81, Jone84) gave this a go; while Mart99a mentioned problems, I'm curious when it does work, since this seems natural, simple, and easier to make robust against model violations than the other methods.
It is well known that if a univariate continuous time autoregression is sampled at equally spaced time intervals, the resulting, discrete time process is ARMA(p,p1). If the sampling includes observational error, the resulting process is ARMA(p,p); however, these 2p parameters depend only on the p continuous time autoregression coefficients and the observational error variance. Modeling, the process as a continuous time autoregression with observational error may be much more parsimonious than modeling the discrete time process, whether or not the data are equally spaced. The direct modeling of observational error has the effect of smoothing noisy data and may eliminate the need for moving average terms.
Online
See recursive estimation.
Misc
Gradient descent learns Linear Dynamical systems
Linear Predictive Coding
LPC introductions traditionally start with a physical model of the human vocal tract as a resonating pipe, then mumble away the details. This confused the hell out of me. AFAICT, an LPC model is just a list of AR regression coefficients and a driving noise source coefficient. This is “coding” because you can round the numbers, pack them down a smidgen and then use it to encode certain time series, such as the human voice, compactly. But it's still a regression analysis, and can be treated as such.
The twists are that

we usually think about it in a compression context

Traditionally one performs many regressions to get timevarying models
It's commonly described as a physical model because we can imagine these regression coefficients corresponding to a simplified physical model of the human vocal tract; But we can think of the regression coefficients as corresponding to any allpole linear system, so I don't think that brings special insight; especially as the models of, say, a resonating pipe, would intuitively be described by timedelays corresponding to the length of the pipe, not timelags corresponding to a corresponding sample plus computational convenience. Sure we can get similar spectral response for this model as with a pipe, according to linear systems theory, but if you are going to assume so much advanced linear systems theory anyway, and mix it with crappy physics, why not just start with the linear systems and ditch the physics?
To discuss: these coefficients as spectrogram smoothing.
Refs
 BeZa76: (1976) A Comparison Between Wiener Filtering, Kalman Filtering, and Deterministic Least Squares Estimation*. Geophysical Prospecting, 24(1), 141–197. DOI
 Mare07: (2007) A Functional Analysis Approach to Subband System Approximation and Identification. IEEE Transactions on Signal Processing, 55(2), 493–506. DOI
 HeDG15: (2015) A New View of Predictive State Methods for Dynamical System Learning. ArXiv:1505.05310 [Cs, Stat].
 AnWa16: (2016) A Nonparametric Model for Stationary Time Series. Journal of Time Series Analysis, 37(1), 126–142. DOI
 PSCP16: (2016) A Survey of Stochastic Simulation and Optimization Methods in Signal Processing. IEEE Journal of Selected Topics in Signal Processing, 10(2), 224–241. DOI
 KMBT11: (2011) Adaptive algorithms for sparse system identification. Signal Processing, 91(8), 1910–1919. DOI
 UnTa14: (2014) An introduction to sparse stochastic processes. New York: Cambridge University Press
 IvAZ10: (2010) Analysis of ecological time series with ARMA( p,q ) models. Ecology, 91(3), 858–871.
 TaKa00: (2000) Asymptotic theory of statistical inference for time series. New York: Springer
 Broe06: (2006) Automatic autocorrelation and spectral analysis. Secaucus, NJ, USA: Springer Science & Business Media
 Mart98: (1998) Autoregression and irregular sampling: Filtering. Signal Processing, 69(3), 229–248. DOI
 Mart99: (1999) Autoregression and irregular sampling: Spectral estimation. Signal Processing, 77(2), 139–157. DOI
 BrWB04: (2004) Autoregressive spectral analysis when observations are missing. Automatica, 40(9), 1495–1504. DOI
 BüKü99: (1999) Block length selection in the bootstrap for time series. Computational Statistics & Data Analysis, 31(3), 295–310. DOI
 Carm14: (2014) Compressive System Identification. In Compressed Sensing & Sparse Filtering (pp. 281–324). Springer Berlin Heidelberg DOI
 Carm13: (2013) Compressive system identification: sequential methods and entropy bounds. Digital Signal Processing, 23(3), 751–770. DOI
 ZhMc06: (2006) Computer Algebra Derivation of the Bias of Burg Estimators. Journal of Time Series Analysis, 27(2), 157–165. DOI
 LaFR04: (2004) Continuous ARMA spectral estimation from irregularly sampled observations. In Proceedings of the 21st IEEE Instrumentation and Measurement Technology Conference, 2004. IMTC 04 (Vol. 2, pp. 923927 Vol.2). DOI
 Vand12: (2012) Convex optimization techniques in system identification. IFAC Proceedings Volumes, 45(16), 71–76. DOI
 DoJR13: (2013) DerivativeFree Estimation of the Score Vector and Observed Information Matrix with Application to StateSpace Models. ArXiv:1304.5768 [Stat].
 Küns86: (1986) Discrimination between monotonic trends and longrange dependence. Journal of Applied Probability, 23(4), 1025–1030.
 ChKF16: (2016) Distributed and parallel time series feature extraction for industrial big data applications. ArXiv:1610.07717 [Cs].
 BoPi70: (1970) Distribution of Residual Autocorrelations in AutoregressiveIntegrated Moving Average Time Series Models. Journal of the American Statistical Association, 65(332), 1509–1526. DOI
 GeMe81: (1981) Estimating regression models of finite but unknown order. Journal of Econometrics, 16(1), 162. DOI
 BrBo06: (2006) Estimating timeseries models from irregularly spaced data. In IEEE Transactions on Instrumentation and Measurement (Vol. 55, pp. 1124–1131). DOI
 TuKu82: (1982) Estimation of frequencies of multiple sinusoids: Making linear prediction perform like maximum likelihood. Proceedings of the IEEE, 70(9), 975–989. DOI
 Paga74: (1974) Estimation of Models of Autoregressive Signal Plus White Noise. The Annals of Statistics, 2(1), 99–108. DOI
 McZh08: (2008) Faster ARMA maximum likelihood estimation. Computational Statistics & Data Analysis, 52(4), 2166–2176. DOI
 Jone81: (1981) Fitting a continuous time autoregression to discrete data. In Applied time series analysis II (pp. 651–682).
 Jone84: (1984) Fitting multivariate models to unequally spaced data. In Time series analysis of irregularly observed data (pp. 158–188). Springer
 Kay93: (1993) Fundamentals of statistical signal processing. Englewood Cliffs, N.J: PrenticeHall PTR
 McSS11a: (2011a) Generalization error bounds for stationary autoregressive models. ArXiv:1103.0942 [Cs, Stat].
 Werb88: (1988) Generalization of backpropagation with application to a recurrent gas market model. Neural Networks, 1(4), 339–356. DOI
 HaMR16: (2016) Gradient Descent Learns Linear Dynamical Systems. ArXiv:1609.05191 [Cs, Math, Stat].
 Mcle98: (1998) Hyperbolic decay time series. Journal of Time Series Analysis, 19(4), 473–483. DOI
 LaSö02: (2002) Identification of continuoustime AR processes from unevenly sampled data. Automatica, 38(4), 709–718. DOI
 MiVi93: (1993) InformationBased Complexity and Nonparametric WorstCase System Identification. Journal of Complexity, 9(4), 427–446. DOI
 XuRa17: (2017) Informationtheoretic analysis of generalization capability of learning algorithms. In Advances In Neural Information Processing Systems.
 MaKP98: (1998) JamesStein state filtering algorithms. IEEE Transactions on Signal Processing, 46(9), 2431–2447. DOI
 HaSZ17: (2017) Learning Linear Dynamical Systems via Spectral Filtering. In NIPS.
 SMTJ18: (2018) Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification. ArXiv:1802.08334 [Cs, Math, Stat].
 KaSH00: (2000) Linear estimation. Upper Saddle River, N.J: Prentice Hall
 Makh75: (1975) Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4), 561–580. DOI
 KaMo00: (2000) Matrices with banded inverses: inversion algorithms and factorization of GaussMarkov processes. IEEE Transactions on Information Theory, 46(4), 1495–1509. DOI
 Akai73: (1973) Maximum likelihood identification of Gaussian autoregressive moving average models. Biometrika, 60(2), 255–265. DOI
 PlDY15: (2015) Mesochronal Structure Learning. Uncertainty in Artificial Intelligence : Proceedings of the … Conference. Conference on Uncertainty in Artificial Intelligence, 31.
 LiMa92: (1992) Model fitting for continuoustime stationary processes from discretetime data. Journal of Multivariate Analysis, 41(1), 56–79. DOI
 DuKo97: (1997) Monte Carlo maximum likelihood estimation for nonGaussian state space models. Biometrika, 84(3), 669–684. DOI
 HeGo15: (2015) Noncausal Autoregressive Model in Application to Bitcoin/USD Exchange Rates. In Econometrics of Risk (pp. 17–40). Springer International Publishing DOI
 Geer02: (2002) On Hoeffdoing’s inequality for dependent random variables. In Empirical Process Techniques for Dependent Data. Birkhhäuser
 Bart46: (1946) On the Theoretical Specification and Sampling Properties of Autocorrelated TimeSeries. Supplement to the Journal of the Royal Statistical Society, 8(1), 27–41. DOI
 SöMo00: (2000) Performance evaluation of methods for identifying continuoustime autoregressive processes. Automatica, 1(36), 53–59. DOI
 McSS11b: (2011b) Risk bounds for time series without strong mixing. ArXiv:1106.0730 [Cs, Stat].
 KeCh72: (1972) Signal detection and extraction by cepstrum techniques. IEEE Transactions on Information Theory, 18(6), 745–759. DOI
 StMo05: (2005) Spectral Analysis of Signals. Upper Saddle River, N.J: Prentice Hall
 HaKo05: (2005) Structural Time Series Models. In Encyclopedia of Biostatistics. John Wiley & Sons, Ltd
 Scar81: (1981) Studies in astronomical time series analysis IModeling random processes in the time domain. The Astrophysical Journal Supplement Series, 45, 1–71.
 Ljun99: (1999) System identification: theory for the user. Upper Saddle River, NJ: Prentice Hall PTR
 ChHo12: (2012) Testing for the Markov Property in Time Series. Econometric Theory, 28(01), 130–178. DOI
 MaFe07: (2007) Testing the Markov property with high frequency data. Journal of Econometrics, 141(1), 44–64. DOI
 RaZa52: (1952) The analysis of sampleddata systems. Transactions of the American Institute of Electrical Engineers, Part II: Applications and Industry, 71(5), 225–234. DOI
 HoLD10: (2010) The ARMA alphabet soup: A tour of ARMA model variants. Statistics Surveys, 4, 232–274. DOI
 Atal06: (2006) The history of linear prediction. IEEE Signal Processing Magazine, 23(2), 154–161. DOI
 Pill16: (2016) The interplay between system identification and machine learning. ArXiv:1612.09158 [Cs, Stat].
 LjSö83: (1983) Theory and practice of recursive identification. Cambridge, Mass: MIT Press
 DuKo12: (2012) Time series analysis by state space methods. Oxford: Oxford University Press
 BJRL16: (2016) Time series analysis: forecasting and control. Hoboken, New Jersey: John Wiley & Sons, Inc
 AASS18: (2018) Time Series Analysis via Matrix Estimation. ArXiv:1802.09064 [Cs, Stat].