Gradient descent, a classic first order optimisation], with many variants, and many things one might wish to understand.
There are only few things I wish to understand for the moment
Descent each coordinate individually.
Small clever hack for certain domains: log gradient descent.
Zeyuan Allen-Zhu : Faster Than SGD 1: Variance Reduction:
SGD is well-known for large-scale optimization. In my mind, there are two (and only two) fundamental improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I’d love to conduct a survey regarding (1), ### Variance-reduced
How and when does it work? and how well? Moritz Hardt, The zen of gradient descent explains it through Chebychev polynomials . Sebastian Bubeck explains it from a different angle, Revisiting Nesterov’s Acceleration to expand upon the rather magical introduction given in his lecture Wibisono et al explain it in terms of
Yellowfin an automatic SGD momentum tuner
Mini-batch and stochastic methods for minimising loss when you have a lot of data, or a lot of parameters, and using it all at once is silly, or when you want to iteratively improve your solution as data comes in, and you have access to a gradient for your loss, ideally automatically calculated. It’s not clear at all that it should work, except by collating all your data and optimising offline, except that much of modern machine learning shows that it does.
Sometimes this apparently stupid trick it might even be fast for small-dimensional cases, so you may as well try.
Technically, “online” optimisation in bandit/RL problems might imply that you have to “minimise regret online”, which has a slightly different meaning and, e.g. involves seeing each training only as it arrives along some notional arrow of time, yet wishing to make the “best” decision at the next time, and possibly choosing your next experiment in order to trade-off exploration versus exploitation etc.
In SGD you can see your data as often as you want and in whatever order, but you only look at a bit at a time. Usually the data is given and predictions make no difference to what information is available to you.
Some of the same technology pops up in each of these notions of online optimisation, but I am really thinking about SGD here.
There are many more permutations and variations used in practice.
a.k.a. Frank-Wolfe algorithm: Don’t know much about this.
- RoSi71: H. Robbins, D. Siegmund (1971) A convergence theorem for non negative almost supermartingales and some applications. In Optimizing Methods in Statistics (pp. 233–257). Academic Press DOI
- YuTo09: Sangwoon Yun, Kim-Chuan Toh (2009) A coordinate gradient descent method for ℓ 1-regularized convex minimization. Computational Optimization and Applications, 48(2), 273–307. DOI
- BeTe09: Amir Beck, Marc Teboulle (2009) A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems. SIAM Journal on Imaging Sciences, 2(1), 183–202. DOI
- Rupp85: David Ruppert (1985) A Newton-Raphson Version of the Multivariate Robbins-Monro Procedure. The Annals of Statistics, 13(1), 236–245. DOI
- STXY16: Hao-Jun Michael Shi, Shenyinying Tu, Yangyang Xu, Wotao Yin (2016) A Primer on Coordinate Descent Algorithms. ArXiv:1610.00040 [Math, Stat].
- CoPe08: Patrick L. Combettes, Jean-Christophe Pesquet (2008) A proximal decomposition method for solving convex variational. Inverse Problems, 24(6), 065014. DOI
- FlPo63: R. Fletcher, M. J. D. Powell (1963) A Rapidly Convergent Descent Method for Minimization. The Computer Journal, 6(2), 163–168. DOI
- ACDL14: Alekh Agarwal, Olivier Chapelle, Miroslav Dudık, John Langford (2014) A Reliable Effective Terascale Linear Learning System. Journal of Machine Learning Research, 15(1), 1111–1133.
- RoMo51: Herbert Robbins, Sutton Monro (1951) A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400–407. DOI
- LuMH15: Aurelien Lucchi, Brian McWilliams, Thomas Hofmann (2015) A Variance Reduced Stochastic Newton Method. ArXiv:1503.08316 [Cs].
- WiWJ16: Andre Wibisono, Ashia C. Wilson, Michael I. Jordan (2016) A variational perspective on accelerated methods in optimization. Proceedings of the National Academy of Sciences, 113(47), E7351–E7358. DOI
- GhLa13a: Saeed Ghadimi, Guanghui Lan (2013a) Accelerated Gradient Methods for Nonconvex Nonlinear and Stochastic Programming. ArXiv:1310.3787 [Math].
- HuPK09: Chonghai Hu, Weike Pan, James T. Kwok (2009) Accelerated gradient methods for stochastic optimization and online learning. In Advances in Neural Information Processing Systems (pp. 781–789). Curran Associates, Inc.
- VSSM06: S.V. N. Vishwanathan, Nicol N. Schraudolph, Mark W. Schmidt, Kevin P. Murphy (2006) Accelerated Training of Conditional Random Fields with Stochastic Gradient Methods. In Proceedings of the 23rd International Conference on Machine Learning.
- Nest07: Yu Nesterov (2007) Accelerating the cubic regularization of Newton’s method on convex problems. Mathematical Programming, 112(1), 159–181. DOI
- PoJu92: B. T. Polyak, A. B. Juditsky (1992) Acceleration of Stochastic Approximation by Averaging. SIAM Journal on Control and Optimization, 30(4), 838–855. DOI
- MHSY13: H. Brendan McMahan, Gary Holt, D. Sculley, Michael Young, Dietmar Ebner, Julian Grady, … Jeremy Kubica (2013) Ad Click Prediction: A View from the Trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1222–1230). New York, NY, USA: ACM DOI
- KiBa15: Diederik Kingma, Jimmy Ba (2015) Adam: A Method for Stochastic Optimization. Proceeding of ICLR.
- Spal00: J. C. Spall (2000) Adaptive stochastic approximation by the simultaneous perturbation method. IEEE Transactions on Automatic Control, 45(10), 1839–1853. DOI
- DuHS11: John Duchi, Elad Hazan, Yoram Singer (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.
- Chre08: Stephane Chretien (2008) An Alternating l1 approach to the compressed sensing problem. ArXiv:0809.0660 [Stat].
- Rude16: Sebastian Ruder (2016) An overview of gradient descent optimization algorithms. ArXiv:1609.04747 [Cs].
- MZHR16: Ioannis Mitliagkas, Ce Zhang, Stefan Hadjis, Christopher Ré (2016) Asynchrony begets Momentum, with an Application to Deep Learning. ArXiv:1605.09774 [Cs, Math, Stat].
- HaLS15: Elad Hazan, Kfir Levy, Shai Shalev-Shwartz (2015) Beyond Convexity: Stochastic Quasi-Convex Optimization. In Advances in Neural Information Processing Systems 28 (pp. 1594–1602). Curran Associates, Inc.
- Gile08: Mike B. Giles (2008) Collected Matrix Derivative Results for Forward and Reverse Mode Algorithmic Differentiation. In Advances in Automatic Differentiation (Vol. 64, pp. 35–44). Berlin, Heidelberg: Springer Berlin Heidelberg
- BiCY14: Wei Bian, Xiaojun Chen, Yinyu Ye (2014) Complexity analysis of interior point algorithms for non-Lipschitz and nonconvex minimization. Mathematical Programming, 149(1–2), 301–327. DOI
- CGWY12: Xiaojun Chen, Dongdong Ge, Zizhuo Wang, Yinyu Ye (2012) Complexity of unconstrained L_2-L_p. Mathematical Programming, 143(1–2), 371–383. DOI
- HaJN15: Zaid Harchaoui, Anatoli Juditsky, Arkadi Nemirovski (2015) Conditional gradient algorithms for norm-regularized smooth convex optimization. Mathematical Programming, 152(1–2), 75–112. DOI
- JaFM14: D. Jakovetic, J.M. Freitas Xavier, J.M.F. Moura (2014) Convergence Rates of Distributed Nesterov-Like Gradient Methods on Random Networks. IEEE Transactions on Signal Processing, 62(4), 868–882. DOI
- BoVa04: Stephen P. Boyd, Lieven Vandenberghe (2004) Convex optimization. Cambridge, UK ; New York: Cambridge University Press
- Bube15: Sébastien Bubeck (2015) Convex Optimization: Algorithms and Complexity. Foundations and Trends® in Machine Learning, 8(3–4), 231–357. DOI
- CeBS14: Volkan Cevher, Stephen Becker, Mark Schmidt (2014) Convex Optimization for Big Data. IEEE Signal Processing Magazine, 31(5), 32–43. DOI
- Bach13: Francis Bach (2013) Convex relaxations of structured matrix factorizations. ArXiv:1309.3117 [Cs, Math].
- WuLa08: Tong Tong Wu, Kenneth Lange (2008) Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics, 2(1), 224–244. DOI
- Boyd10: Stephen Boyd (2010) Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers. Foundations and Trends® in Machine Learning, 3(1), 1–122. DOI
- MKJS15: Chenxin Ma, Jakub Konečnỳ, Martin Jaggi, Virginia Smith, Michael I. Jordan, Peter Richtárik, Martin Takáč (2015) Distributed Optimization with Arbitrary Local Solvers. ArXiv Preprint ArXiv:1512.04039.
- MaBe17: Siyuan Ma, Mikhail Belkin (2017) Diving into the shallows: a computational perspective on large-scale shallow learning. ArXiv:1703.10622 [Cs, Stat].
- Nest12: Y. Nesterov (2012) Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems. SIAM Journal on Optimization, 22(2), 341–362. DOI
- SGAL14: Levent Sagun, V. Ugur Guney, Gerard Ben Arous, Yann LeCun (2014) Explorations on high dimensional landscapes. ArXiv:1412.6615 [Cs, Stat].
- GoSB15: Tom Goldstein, Christoph Studer, Richard Baraniuk (2015) FASTA: A Generalized Implementation of Forward-Backward Splitting. ArXiv:1501.04979 [Cs, Math].
- AbHa15: Jacob Abernethy, Elad Hazan (2015) Faster Convex Optimization: Simulated Annealing with an Efficient Universal Barrier. ArXiv:1507.02528 [Cs, Math].
- MEKU08: Doug Mcleod, Garry Emmerson, Robert Kohn, Geoff Kingston (universit (2008) Finding the invisible hand: an objective model of financial markets
- Batt92: Roberto Battiti (1992) First-and second-order methods for learning: between steepest descent and Newton’s method. Neural Computation, 4(2), 141–166. DOI
- Dala17: Arnak S. Dalalyan (2017) Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent. ArXiv:1704.04752 [Math, Stat].
- Nest12: Yu Nesterov (2012) Gradient methods for minimizing composite functions. Mathematical Programming, 140(1), 125–161. DOI
- KiHa16: Daeun Kim, Justin P. Haldar (2016) Greedy algorithms for nonnegativity-constrained simultaneous sparse recovery. Signal Processing, 125, 274–289. DOI
- FrSc12: Michael P. Friedlander, Mark Schmidt (2012) Hybrid Deterministic-Stochastic Methods for Data Fitting. SIAM Journal on Scientific Computing, 34(3), A1380–A1405. DOI
- DPGC14: Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, Yoshua Bengio (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems 27 (pp. 2933–2941). Curran Associates, Inc.
- YaMM15: Jiyan Yang, Xiangrui Meng, Michael W. Mahoney (2015) Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments. ArXiv:1502.03032 [Cs, Math, Stat].
- BoLl15: Zdravko I. Botev, Chris J. Lloyd (2015) Importance accelerated Robbins-Monro recursion with applications to parametric confidence limits. Electronic Journal of Statistics, 9(2), 2058–2075. DOI
- Mair15: J. Mairal (2015) Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 25(2), 829–855. DOI
- Nest04: Yurii Nesterov (2004) Introductory Lectures on Convex Optimization (Vol. 87). Boston, MA: Springer US
- StVi16: Damian Straszak, Nisheeth K. Vishnoi (2016) IRLS and Slime Mold: Equivalence and Convergence. ArXiv:1601.02712 [Cs, Math, Stat].
- WiNa16: David Wipf, Srikantan Nagarajan (2016) Iterative Reweighted l1 and l2 Methods for Finding Sparse Solution. Microsoft Research.
- ChYi08: R. Chartrand, Wotao Yin (2008) Iteratively reweighted algorithms for compressive sensing. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2008. ICASSP 2008 (pp. 3869–3872). DOI
- SFJJ15: Virginia Smith, Simone Forte, Michael I. Jordan, Martin Jaggi (2015) L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework. ArXiv:1512.04011 [Cs].
- Klei04: Dan Klein (2004) Lagrange multipliers without permanent scarring. University of California at Berkeley, Computer Science Division.
- BoLe04: Léon Bottou, Yann LeCun (2004) Large Scale Online Learning. In Advances in Neural Information Processing Systems 16. Cambridge, MA: MIT Press
- Bott10: Léon Bottou (2010) Large-scale machine learning with stochastic gradient descent. In Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010) (pp. 177–186). Paris, France: Springer
- HoSr15: Reshad Hosseini, Suvrit Sra (2015) Manifold Optimization for Gaussian Mixture Models. ArXiv Preprint ArXiv:1506.07677.
- BMAS14: Nicolas Boumal, Bamdev Mishra, P.-A. Absil, Rodolphe Sepulchre (2014) Manopt, a Matlab Toolbox for Optimization on Manifolds. Journal of Machine Learning Research, 15, 1455–1459.
- MaNT04: K Madsen, H.B. Nielsen, O. Tingleff (2004) Methods for non-linear least squares problems
- BoLB16: Aleksandar Botev, Guy Lever, David Barber (2016) Nesterov’s Accelerated Gradient and Momentum as approximations to Regularised Update Descent. ArXiv:1607.01981 [Cs, Stat].
- HiSK00: Geoffrey Hinton, Nitish Srivastava, Kevin Swersky (n.d.) Neural Networks for Machine Learning
- BaMo11: Francis Bach, Eric Moulines (2011) Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning. In Advances in Neural Information Processing Systems (NIPS) (p. ). Spain
- Devo98: Ronald A. DeVore (1998) Nonlinear approximation. Acta Numerica, 7, 51–150. DOI
- BaMo13: Francis R. Bach, Eric Moulines (2013) Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n). In arXiv:1306.2119 [cs, math, stat] (pp. 773–781).
- WiWi15: Andre Wibisono, Ashia C. Wilson (2015) On Accelerated Methods in Optimization. ArXiv:1509.03616 [Math].
- Heyd74: C. C. Heyde (1974) On martingale limit theory and strong convergence results for stochastic approximation procedures. Stochastic Processes and Their Applications, 2(4), 359–370. DOI
- Pate17: Vivak Patel (2017) On SGD’s Failure in Practice: Characterizing and Overcoming Stalling. ArXiv:1702.00317 [Cs, Math, Stat].
- Gold65: A. Goldstein (1965) On Steepest Descent. Journal of the Society for Industrial and Applied Mathematics Series A Control, 3(1), 147–151. DOI
- Bott98: Léon Bottou (1998) Online Algorithms and Stochastic Approximations. In Online Learning and Neural Networks (Vol. 17, p. 142). Cambridge, UK: Cambridge University Press
- Zink03: Martin Zinkevich (2003) Online Convex Programming and Generalized Infinitesimal Gradient Ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (pp. 928–935). Washington, DC, USA: AAAI Press
- AlHa16: Zeyuan Allen-Zhu, Elad Hazan (2016) Optimal Black-Box Reductions Between Optimization Objectives. In Advances in Neural Information Processing Systems 29 (pp. 1606–1614). Curran Associates, Inc.
- BePo05: D. Bertsimas, I. Popescu (2005) Optimal Inequalities in Probability Theory: A Convex Optimization Approach. SIAM Journal on Optimization, 15(3), 780–804. DOI
- ScFR09: Mark Schmidt, Glenn Fung, Romer Rosales (2009) Optimization methods for l1-regularization. University of British Columbia, Technical Report TR-2009, 19.
- BoCN16: Léon Bottou, Frank E. Curtis, Jorge Nocedal (2016) Optimization Methods for Large-Scale Machine Learning. ArXiv:1606.04838 [Cs, Math, Stat].
- Mair13a: Julien Mairal (2013a) Optimization with First-Order Surrogate Functions. In International Conference on Machine Learning (pp. 783–791).
- BJMO12: Francis Bach, Rodolphe Jenatton, Julien Mairal, Guillaume Obozinski (2012) Optimization with Sparsity-Inducing Penalties. Foundations and Trends® in Machine Learning, 4(1), 1–106. DOI
- ZWLS10: Martin Zinkevich, Markus Weimer, Lihong Li, Alex J. Smola (2010) Parallelized Stochastic Gradient Descent. In Advances in Neural Information Processing Systems 23 (pp. 2595–2603). Curran Associates, Inc.
- FHHT07: Jerome Friedman, Trevor Hastie, Holger Höfling, Robert Tibshirani (2007) Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302–332. DOI
- RoZh07: Saharon Rosset, Ji Zhu (2007) Piecewise linear regularized solution paths. The Annals of Statistics, 35(3), 1012–1030. DOI
- PaBo14: Neal Parikh, Stephen Boyd (2014) Proximal Algorithms. Foundations and Trends® in Optimization, 1(3), 127–239. DOI
- ToKW16: James Townsend, Niklas Koep, Sebastian Weichwald (2016) Pymanopt: A Python Toolbox for Optimization on Manifolds using Automatic Differentiation. Journal of Machine Learning Research, 17(137), 1–5.
- LiMH16: Hongzhou Lin, Julien Mairal, Zaid Harchaoui (2016) QuickeNing: A Generic Quasi-Newton Algorithm for Faster Gradient-Based Optimization *. In arXiv:1610.00960 [math, stat].
- MaBo10: J. Mattingley, S. Boyd (2010) Real-Time Convex Optimization in Signal Processing. IEEE Signal Processing Magazine, 27(3), 50–61. DOI
- FoKe09: Alexander I. J. Forrester, Andy J. Keane (2009) Recent advances in surrogate-based optimization. Progress in Aerospace Sciences, 45(1–3), 50–79. DOI
- GaRC09: G. Gasso, A. Rakotomamonjy, S. Canu (2009) Recovering Sparse Signals With a Certain Family of Nonconvex Penalties and DC Programming. IEEE Transactions on Signal Processing, 57(12), 4686–4698. DOI
- LiLR16: Yuanzhi Li, Yingyu Liang, Andrej Risteski (2016) Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates. In Advances in Neural Information Processing Systems 29 (pp. 4988–4996). Curran Associates, Inc.
- SFHT11: Noah Simon, Jerome Friedman, Trevor Hastie, Rob Tibshirani (2011) Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software, 39(5).
- FrHT10: Jerome Friedman, Trevor Hastie, Rob Tibshirani (2010) Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1), 1–22. DOI
- Jagg13: Martin Jaggi (2013) Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In Journal of Machine Learning Research (pp. 427–435).
- AcNP05: Dimitris Achlioptas, Assaf Naor, Yuval Peres (2005) Rigorous location of phase transitions in hard optimization problems. Nature, 435(7043), 759–764. DOI
- DeBL14: Aaron Defazio, Francis Bach, Simon Lacoste-Julien (2014) SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives. In Advances in Neural Information Processing Systems 27.
- AgBH16: Naman Agarwal, Brian Bullins, Elad Hazan (2016) Second Order Stochastic Optimization in Linear Time. ArXiv:1602.03943 [Cs, Stat].
- BoBG09: Antoine Bordes, Léon Bottou, Patrick Gallinari (2009) SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent. Journal of Machine Learning Research, 10, 1737–1754.
- Chen12: Xiaojun Chen (2012) Smoothing methods for nonsmooth, nonconvex minimization. Mathematical Programming, 134(1), 71–99. DOI
- ZYJZ15: Lijun Zhang, Tianbao Yang, Rong Jin, Zhi-Hua Zhou (2015) Sparse Learning for Large-scale and High-dimensional Data: A Randomized Convex-concave Optimization Approach. ArXiv:1511.03766 [Cs].
- MaBP14: Julien Mairal, Francis Bach, Jean Ponce (2014) Sparse modeling for image and vision processing. Foundations and Trends® in Comput Graph. Vis., 8(2–3), 85–283. DOI
- LaLZ09: John Langford, Lihong Li, Tong Zhang (2009) Sparse Online Learning via Truncated Gradient. In Advances in Neural Information Processing Systems 21 (pp. 905–912). Curran Associates, Inc.
- WrNF09: S. J. Wright, R. D. Nowak, M. A. T. Figueiredo (2009) Sparse Reconstruction by Separable Approximation. IEEE Transactions on Signal Processing, 57(7), 2479–2493. DOI
- Lai03: Tze Leung Lai (2003) Stochastic Approximation. The Annals of Statistics, 31(2), 391–406. DOI
- GhLa13b: Saeed Ghadimi, Guanghui Lan (2013b) Stochastic First- and Zeroth-order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 23(4), 2341–2368. DOI
- SSBA95: Sashank J. Reddi, Suvrit Sra, Barnabás Póczós, Alex Smola (1995) Stochastic Frank-Wolfe Methods for Nonconvex Optimization.
- Frie02: Jerome H. Friedman (2002) Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378. DOI
- Bott12: Léon Bottou (2012) Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade (pp. 421–436). Springer, Berlin, Heidelberg DOI
- Bott91: Léon Bottou (1991) Stochastic Gradient Learning in Neural Networks. In Proceedings of Neuro-Nîmes 91. Nimes, France: EC2
- Mair13b: Julien Mairal (2013b) Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems (pp. 2283–2291).
- ShTe11: Shai Shalev-Shwartz, Ambuj Tewari (2011) Stochastic Methods for L1-regularized Loss Minimization. Journal of Machine Learning Research, 12, 1865–1892.
- NLST17: Lam M. Nguyen, Jie Liu, Katya Scheinberg, Martin Takáč (2017) Stochastic Recursive Gradient Algorithm for Nonconvex Optimization. ArXiv:1705.07261 [Cs, Math, Stat].
- RHSP16: Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola (2016) Stochastic Variance Reduction for Nonconvex Optimization. In PMLR (Vol. 1603, pp. 314–323).
- ZhWG17: Xiao Zhang, Lingxiao Wang, Quanquan Gu (2017) Stochastic Variance-reduced Gradient Descent for Low-rank Matrix Recovery from Linear Measurements. ArXiv:1701.00481 [Stat].
- Wain14: Martin J. Wainwright (2014) Structured Regularizers for High-Dimensional Problems: Statistical and Computational Issues. Annual Review of Statistics and Its Application, 1(1), 233–253. DOI
- QHSG05: Nestor V. Queipo, Raphael T. Haftka, Wei Shyy, Tushar Goel, Rajkumar Vaidyanathan, P. Kevin Tucker (2005) Surrogate-based analysis and optimization. Progress in Aerospace Sciences, 41(1), 1–28. DOI
- HaZh12: Zhong-Hua Han, Ke-Shi Zhang (2012) Surrogate-based optimization.
- AABB16: Martın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, … Matthieu Devin (2016) TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. ArXiv Preprint ArXiv:1603.04467.
- BuEl14: Sébastien Bubeck, Ronen Eldan (2014) The entropic barrier: a simple and optimal universal self-concordant barrier. ArXiv:1412.1587 [Cs, Math].
- PoKo97: Stephen Portnoy, Roger Koenker (1997) The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute-error estimators. Statistical Science, 12(4), 279–300. DOI
- This97: Ronald A. Thisted (1997) [The Gaussian Hare and the Laplacian Tortoise: Computability of Squared-Error versus Absolute-Error Estimators]: Comment. Statistical Science, 12(4), 296–298.
- MeBM16: Song Mei, Yu Bai, Andrea Montanari (2016) The Landscape of Empirical Risk for Non-convex Losses. ArXiv:1607.06534 [Stat].
- CHMB15: Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, Yann LeCun (2015) The Loss Surfaces of Multilayer Networks. In Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics (pp. 192–204).
- Levy16: Kfir Y. Levy (2016) The Power of Normalization: Faster Evasion of Saddle Points. ArXiv:1611.04831 [Cs, Math, Stat].
- BoBo08: Léon Bottou, Olivier Bousquet (2008) The Tradeoffs of Large Scale Learning. In Advances in Neural Information Processing Systems (Vol. 20, pp. 161–168). NIPS Foundation (http://books.nips.cc)
- Xu11: Wei Xu (2011) Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent. ArXiv:1107.2490 [Cs].
- SaKi16: Tim Salimans, Diederik P Kingma (2016) Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In Advances in Neural Information Processing Systems 29 (pp. 901–901). Curran Associates, Inc.