Using the machinery of linear regression to predict in somewhat more general regressions, using leastsquares or quasilikelihood approaches. This means you are still doing something like Maximum Likelihood regression, but outside the setting of homoskedastic Gaussian noise and linear response.
Classic linear models
Consider the original linear model. We have a (column) vector n) observations,an matrix of covariateswhere each column corresponds to a different covariate and each row to a different observation.
We assume the observations are assumed to related to the covariates by
where gives the parameters of the model which we don't yet know, We call the “residual” vector. Legendre and Gauss pioneered the estimation of the parameters of a linear model by minimising the squared residuals, , i.e.
where we find the pseudo inverse using a numerical solver of some kind, using one of many carefully optimised methods that exists for least squares.
So far there is no statistical argument, merely function approximation.
However it turns out that if you assume that the are distributed randomly and independently i.i.d. errors in the observations (or at least indepenedent with constant variance), then there is also a statistical justification for this idea;
TODO: more exposition of these. Linkage to Maximum likelihood.
Generalised linear models
The original extension. TODO: explain.
To learn:

When we can do this? e.g. Must the response be from an exponential family for really real? What happens if not?

Does anything funky happen with regularisation? what?

When you combine all these fancy GLM extensions, how the hell do you work out if your parameters are identifiable?

nonmonotonic relations between predictors  how does one handle these?

model selection?
Response distribution
TBD. What constraints do we have here?
Linear Predictor
Link function
An invertible (monotonic?) function relating the mean of the linear predictor and the mean of the response distribution.
Quaslilikelihood
An generalisation of likelihood of use in some tricky corners of GLMs. Wedd74 used it to provide a unified GLM/ML rationale.
I don't yet understand it.
Heyde says (Heyd97):
Historically there are two principal themes in statistical parameter estimation theory
It is now possible to unify these approaches under the general description of quasilikelihood and to develop the theory of parameter estimation in a very general setting. […]
It turns out that the theory needs to be developed in terms of estimating functions (functions of both the data and the parameter) rather than the estimators themselves. Thus, our focus will be on functions that have the value of the parameter as a root rather than the parameter itself.
Hierarchical generalised linear models
GLM + hierarchical model = HGLM.
Generalised additive models
Generalised generalised linear models.
Semiparametric simultaneous discovery of some nonlinear predictors and their response curve under the assumption that the interaction is additive in the transformed predictors
These have now also been generalised in the obvious way.
Generalised additive models for location, scale and shape
Folding GARCH and other regession models into GAMs.
GAMLSS is a modern distributionbased approach to (semiparametric) regression models, where all the parameters of the assumed distribution for the response can be modelled as additive functions of the explanatory variables
Generalised hierarchical additive models for location, scale and shape
Exercise for the student.
Refs
 BrCl93: (1993) Approximate Inference in Generalized Linear Mixed Models. Journal of the American Statistical Association, 88(421), 9–25. DOI
 FiSi16: (2016) Approximate Smoothing and Parameter Estimation in HighDimensional StateSpace Models. ArXiv:1606.08650 [Stat].
 XiWJ14: (2014) Asymptotic properties of maximum quasilikelihood estimator in quasilikelihood nonlinear models with misspecified variance function. Statistics, 48(4), 778–786. DOI
 SGVV13: (2013) Estimating VaR and ES of the spot price of oil using futuresvarying centiles. International Journal of Financial Engineering and Risk Management, 1(1), 6–19. DOI
 Wood08: (2008) Fast stable direct fitting and smoothness selection for generalized additive models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 70(3), 495–518. DOI
 DRGV00: (n.d.) Flexible Regression and Smoothing: Using GAMLSS in R
 HaTi90: (1990) Generalized additive models (Vol. 43). CRC Press
 MFHK12: (2012) Generalized additive models for location, scale and shape for high dimensional data—a flexible approach based on boosting. Journal of the Royal Statistical Society: Series C (Applied Statistics) , 61(3), 403–427. DOI
 StRO07: (2007) Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, 23(7), 1–46. DOI
 CuDE06: (2006) Generalized linear array models with applications to multidimensional smoothing. Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 68(2), 259–280. DOI
 BBCG09: (2009) Generalized linear mixed models: a practical guide for ecology and evolution. Trends in Ecology & Evolution, 24(3), 127–135. DOI
 NeWe72: (1972) Generalized Linear Models. Journal of the Royal Statistical Society. Series A (General) , 135(3), 370–384. DOI
 Mccu84: (1984) Generalized linear models. European Journal of Operational Research, 16(3), 285–292. DOI
 NeBa04: (2004) Generalized Linear Models. In Encyclopedia of Statistical Sciences. John Wiley & Sons, Inc.
 LeNP06: (2006) Generalized linear models with random effects. Boca Raton, FL: Chapman & Hall/CRC
 VeDi04: (2004) GLMs, GAMs and GLMMs: an overview of theory for applications in fisheries research. Fisheries Research, 70(2–3), 319–337. DOI
 EiDD16: (2016) Graphical Modeling for Multivariate Hawkes Processes with Nonparametric Link Functions. Journal of Time Series Analysis, n/an/a. DOI
 ThAH15: (2015) LASSO with Nonlinear Measurements is Equivalent to One With Linear Measurements. In Advances in Neural Information Processing Systems 28 (pp. 3402–3410). Curran Associates, Inc.
 BuHT89: (1989) Linear Smoothers and Additive Models. The Annals of Statistics, 17(2), 453–510.
 Wedd76: (1976) On the existence and uniqueness of the maximum likelihood estimates for certain generalized linear models. Biometrika, 63(1), 27–32. DOI
 Hans10: (2010) Penalized maximum likelihood estimation for generalized linear point processes. ArXiv:1003.0848 [Math, Stat].
 BKMM17: (2017) Phase Transitions, Optimal Errors and Optimality of MessagePassing in Generalized Linear Models. ArXiv:1708.03395 [CondMat, Physics:MathPh].
 Heyd97: (1997) Quasilikelihood and its application a general approach to optimal parameter estimation. New York: Springer
 Wedd74: (1974) Quasilikelihood functions, generalized linear models, and the Gauss—Newton method. Biometrika, 61(3), 439–447. DOI
 FrHT10: (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–22. DOI
 BHBR16: (2016) Saturating Splines and Feature Selection. ArXiv:1609.06764 [Stat].
 Atal06: (2006) The history of linear prediction. IEEE Signal Processing Magazine, 23(2), 154–161. DOI