The Living Thing / Notebooks :

AutoML

hyperparameter selection with the use of yet more hyperparameters

The sub-field of optimisation that specifically aims to automate model selection in machine learning. (and also occasionally ensemble construction)

Quoc Le & Barret Zoph, weigh in for google:

Typically, our machine learning models are painstakingly designed by a team of engineers and scientists. This process of manually designing machine learning models is difficult because the search space of all possible models can be combinatorially large — a typical 10-layer network can have ~1010 candidate networks! […]

To make this process of designing machine learning models much more accessible, we’ve been exploring ways to automate the design of machine learning models. […] in this blog post, we’ll focus on our reinforcement learning approach and the early results we’ve gotten so far.

In our approach (which we call “AutoML”), a controller neural net can propose a “child” model architecture, which can then be trained and evaluated for quality on a particular task. That feedback is then used to inform the controller how to improve its proposals for the next round.

Should you bother getting fancy about this? Ben Recht argues no, that Random search is competitive with highly tuned Bayesian methods in hyperparameter tuning. Let’s ignore him for a moment though and sniff in the hype.

Differentiable hyperparameter optimisation

images/hyperparameter_diff.png

Hyperparameter optimization by gradient descent. Each meta-iteration runs an entire training run of stochastic gradient de- scent to optimize elementary parameters (weights 1 and 2). Gradients of the validation loss with respect to hyperparameters are then computed by propagating gradients back through the elemen- tary training iterations. Hyperparameters (in this case, learning rate and momentum schedules) are then updated in the direction of this hypergradient. (MaDA15)

MaDA15

The last remaining parameter to SGD is the initial parameter vector. Treating this vector as a hyperparameter blurs the distinction between learning and meta-learning. In the extreme case where all elementary learning rates are set to zero, the training set ceases to matter and the meta-learning procedure exactly reduces to elementary learning on the validation set. Due to philosophical vertigo, we chose not to optimize the initial parameter vector.

Their implementation, hypergrad, is cool, but no longer maintained.

Bayesian/surrogate optimisation

See Bayesian optimisation

Implementations

Refs

Beng00
Bengio, Y. (2000) Gradient-Based Optimization of Hyperparameters. Neural Computation, 12(8), 1889–1900. DOI.
BBBK11
Bergstra, J. S., Bardenet, R., Bengio, Y., & Kégl, B. (2011) Algorithms for hyper-parameter optimization. In Advances in Neural Information Processing Systems (pp. 2546–2554). Curran Associates, Inc.
FKES15
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., & Hutter, F. (2015) Efficient and Robust Automated Machine Learning. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 28 (pp. 2962–2970). Curran Associates, Inc.
FLFL16
Fu, J., Luo, H., Feng, J., Low, K. H., & Chua, T.-S. (2016) DrMAD: Distilling Reverse-Mode Automatic Differentiation for Optimizing Hyperparameters of Deep Neural Networks. ArXiv:1601.00917 [Cs].
GeSA14
Gelbart, M. A., Snoek, J., & Adams, R. P.(2014) Bayesian Optimization with Unknown Constraints. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence (pp. 250–259). Arlington, Virginia, United States: AUAI Press
GAOS10
Grünewälder, S., Audibert, J.-Y., Opper, M., & Shawe-Taylor, J. (2010) Regret Bounds for Gaussian Process Bandit Problems. (Vol. 9, pp. 273–280). Presented at the AISTATS 2010 - Thirteenth International Conference on Artificial Intelligence and Statistics
HuHL11
Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2011) Sequential Model-Based Optimization for General Algorithm Configuration. In Learning and Intelligent Optimization (pp. 507–523). Springer, Berlin, Heidelberg DOI.
HuHL13
Hutter, F., Hoos, H., & Leyton-Brown, K. (2013) An Evaluation of Sequential Model-based Optimization for Expensive Blackbox Functions. In Proceedings of the 15th Annual Conference Companion on Genetic and Evolutionary Computation (pp. 1209–1216). New York, NY, USA: ACM DOI.
IaMS00
Ian Dewancker, Michael McCourt, & Scott Clark. (n.d.) Bayesian Optimization Primer.
LJDR16
Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., & Talwalkar, A. (2016) Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization. ArXiv:1603.06560 [Cs, Stat].
Macl16
Maclaurin, D. (2016) Modeling, Inference and Optimization with Composable Differentiable Procedures.
MaDA15
Maclaurin, D., Duvenaud, D. K., & Adams, R. P.(2015) Gradient-based Hyperparameter Optimization through Reversible Learning. In ICML (pp. 2113–2122).
Močk75
Močkus, J. (1975) On Bayesian Methods for Seeking the Extremum. In P. D. G. I. Marchuk (Ed.), Optimization Techniques IFIP Technical Conference (pp. 400–404). Springer Berlin Heidelberg DOI.
SnLA12
Snoek, J., Larochelle, H., & Adams, R. P.(2012) Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems (pp. 2951–2959). Curran Associates, Inc.
SSZA14
Snoek, J., Swersky, K., Zemel, R., & Adams, R. (2014) Input Warping for Bayesian Optimization of Non-Stationary Functions. In Proceedings of the 31st International Conference on Machine Learning (ICML-14) (pp. 1674–1682).
SKKS12
Srinivas, N., Krause, A., Kakade, S. M., & Seeger, M. (2012) Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. IEEE Transactions on Information Theory, 58(5), 3250–3265. DOI.
SwSA13
Swersky, K., Snoek, J., & Adams, R. P.(2013) Multi-Task Bayesian Optimization. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 26 (pp. 2004–2012). Curran Associates, Inc.
THHL13
Thornton, C., Hutter, F., Hoos, H. H., & Leyton-Brown, K. (2013) Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 847–855). New York, NY, USA: ACM DOI.