Bandit problems, Markov decision processes, a smattering of dynamic programming, game theory, optimal control, and online learning the solutions to such problems, esp. reinforcement learning.
Learning, where you must learn an optimal action in response to your stimulus, possibly an optimal “policy” of trying different actions over time, not just an MMSEminimal prediction from complete data.
Comes in adversarial and stochastic flavours, apparently, although I've hit the boundaries of my knowledge there.
Pseudopolitical diversion
See clickbait bandits.
Details
TBD.
Conceptually, the base model here is a one or manyarmed poker machine. You can pop coins in, and each time you do you may pull an arm; you might get rewarded. Each arm of the machine might have different returns; but the only way to find out is to play.
How do you choose optimally which arms to pull and when? How much is to work spending to find the arm with the best return on investment, given that it costs to collect more data?
This can be formalised by minimizing regret and defining some other terms and what you get out is a formalized version of Skinnerian learning that you can easily implement as an algorithm and feel that you have got some satisfyingly optimal properties for.
The thing about whether to choose a new arm when you are on a known good one, or to switch to another one in hope of it being better, this is a symbolic one and it's called the exploration/exploitation tradeoff.
Delayed reward
Not really an ex
Richard Xu's take:

No data label as in supervised learning, only a reward signal

feedback is delayed

data are not i.i.d. but sequentially dependent

agent's actions affect the data it receives.
Multiworld testing
New tool by Microsoft: MultiWorld Testing (MWT) appears to be an online learning problem that augments its data by reusing the data for offline testing:
… is a toolbox of machine learning technology for principled and efficient experimentation, plausibly applicable to most Microsoft services that interact with customers. In many scenarios, this technology is exponentially more efficient than the traditional A/B testing. The underlying research area, mature and yet very active, is known under many names: “multiarmed bandits”, “contextual bandits”, “associative reinforcement learning”, and “counterfactual evaluation”, among others.
To take an example, suppose one wants to optimize clicks on suggested news stories. To discover what works, one needs to explore over the possible news stories. Further, if the suggested news story can be chosen depending on the visitor's profile, then one needs to explore over the possible “policies” that map profiles to news stories (and there are exponentially more “policies” than news stories!). Traditional ML fails at this because it does not explore. Whereas MWT allows you to explore continuously, and optimize your decisions using this exploration data.
It has a better introduction here, by John Langford: (cf ABCD15)
Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a halfdozen realworld successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.
Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what we’ve done and now it can be used by anyone. This drops the barrier to use down to: “Do you have permissions? And do you have a reasonable idea of what a good feature is?”
A key foundational idea is Multiworld Testing: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system we’ve made at Microsoft Research.
Extensions
Deep reinforcement learning
Of course, artificial neural networks are a thing in this domain too.
See Andrej Karpathy's explanation.
Casual concrete example and intro by Mat Kelcey.
The trick is you approximate the action table in Qlearning using a neural net.
Implementation: kerasrl
Intros
Here's an intro to all of machine learning through a historical tale about how one particular attempt to teach a machine (not a computer!) to play tictactoe: Rodney Brooks, Machine Learning Explained.
Practice

Arthur Juliani's Simple Reinforcement Learning in Tensorflow
Theory

Sébastien Bubeck's intro (complements and update BuCe12)

Sergey Feldman, Bandits for Recommendation Systems is an EZintroduction.

Microsoft's recommended intro is BuCe12.

The internet love's David Silver's course.

Richard S. Sutton and Andrew G. Barto's Reinforcement Learning: An Introduction

Elad Hazan's origin story for online convex optimisation has just made me realise how new it is. It complements…

Elad Hazan and Satyan Kale's tutorial on online convex optimisation.

The aging, but gentle intro resource, AIdepot's Reinforcement learning page
Banditsmeetoptimisation
Bubeck again: Kernelbased methods for bandit convex optimization, part 1.
Banditsmeetevolution
Ben Recht and Roy Frostig, Nesterov's punctuation equilibrium:
In a landmark new paper by Salimans, Ho, Chen, and Sutskever from OpenAI, (SHCS17) the authors show that a particular class of genetic algorithms (called Evolutionary Strategies) gives excellent performance on a variety of reinforcement learning benchmarks. As optimizers, the application of genetic algorithms raises red flags and usually causes us to close browser windows. But fear not! As we will explain, the particular algorithm deployed happens to be a core method in optimization, and the fact that this method is successful sheds light on the peculiarities of reinforcement learning more than it does about genetic algorithms in general.
Markov decision problems
Bellman and Howard's classic discrete time control stochastic problem
 Warren Powell's Introduction to markov decision processes
POMDP
Too many CPU cycles? Why not look at optimal control in this setting?
“A POMDP is a partially observable Markov decision process. It is a model, originating in the operations research (OR) literature, for describing planning tasks in which the decision maker does not have complete information as to its current state. The POMDP model provides a convenient way of reasoning about tradeoffs between actions to gain reward and actions to gain information.”
Practicalities
Vowpal Wabbit does contextual bandit learning.:
VW contains a contextual bandit module which allows you to optimize a predictor based on already collected contextual bandit data. In other words, the module does not handle the exploration issue, it assumes it can only use the currently available data previously collected from some “exploration” policy.
Multiple World Testing source code.
Sequential surrogate interactive model optimisation
Not what I think of here, but it and many other hyperparameter optimizationtype problems have an RL interpretation. That is, you can use RL to learn the hyper parameters of your deep learning model. (This is not the same as deep learning of RL policies.) See Sequential surrogate model optimisation for more of that, or the papers JaTa15, LJDR16a, LJDR16b, FBBZ17.
To read
 Bell57: Richard Bellman (1957) A Markovian decision process. DTIC Document
 BeKa61: Richard Bellman, Robert Kalaba (1961) A note on interrupted stochastic control processes. Information and Control, 4(4), 346–349. DOI
 KrAL16: Akshay Krishnamurthy, Alekh Agarwal, John Langford (2016) ContextualMDPs for PACReinforcement Learning with Rich Observations. ArXiv:1602.02722 [Cs, Stat].
 LCKG15: Lihong Li, Shunbao Chen, Jim Kleban, Ankur Gupta (2015) Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study. In Proceedings of the 24th International World Wide Web Conference (WWW’14), Companion Volume. ACM – Association for Computing Machinery
 BPQC13: Léon Bottou, Jonas Peters, Joaquin QuiñoneroCandela, Denis X. Charles, D. Max Chickering, Elon Portugaly, … Ed Snelson (2013) Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research, 14, 3207–3260.
 Howa60: Ronald A. Howard (1960) Dynamic Programming and Markov Processes.
 Thru92: Sebastian B. Thrun (1992) Efficient Exploration In Reinforcement Learning
 LJDR16a: Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar (2016a) Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits. ArXiv:1603.06560 [Cs, Stat].
 SHCS17: Tim Salimans, Jonathan Ho, Xi Chen, Ilya Sutskever (2017) Evolution Strategies as a Scalable Alternative to Reinforcement Learning. ArXiv:1703.03864 [Cs, Stat].
 Si04: Jennie Si (2004) Handbook of learning and approximate dynamic programming (Vol. 2). John Wiley & Sons
 LJDR16b: Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar (2016b) Hyperband: A Novel BanditBased Approach to Hyperparameter Optimization. ArXiv:1603.06560 [Cs, Stat].
 ABCD15: Alekh Agarwal, Sarah Bird, Markus Cozowicz, Miro Dudik, John Langford, Lihong Li, … Alex Slivkins (2015) Introduction to MultiWorld Testing
 SLLK11: Alexander L. Strehl, John Langford, Lihong Li, Sham M. Kakade (2011) Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems 23 (NIPS10) (pp. 2217–2225).
 Lang13: John Langford (2013) Learning to interact
 Bott14: Leon Bottou (2014) Learning to Interact
 PaSa17: Emilio Parisotto, Ruslan Salakhutdinov (2017) Neural Map: Structured Memory for Deep Reinforcement Learning. ArXiv:1702.08360 [Cs].
 JaTa15: Kevin Jamieson, Ameet Talwalkar (2015) Nonstochastic Best Arm Identification and Hyperparameter Optimization. ArXiv:1502.07943 [Cs, Stat].
 SMSM00: Richard S. Sutton, David A. McAllester, Satinder P. Singh, Yishay Mansour (2000) Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems (pp. 1057–1063).
 CeLu06: Nicolo CesaBianchi, Gábor Lugosi (2006) Prediction, Learning, and Games. Cambridge ; New York: Cambridge University Press
 ShYC06: Takeshi Shibata, Ryo Yoshinaka, Takashi Chikayama (2006) Probabilistic Generalization of Simple Grammars and Its Application to Reinforcement Learning. In Algorithmic Learning Theory (pp. 348–362). Springer Berlin Heidelberg
 BuMS11: Sébastien Bubeck, Rémi Munos, Gilles Stoltz (2011) Pure Exploration in Finitely–Armed and Continuous–Armed Bandits. Theoretical Computer Science, 412, 1832–1852.
 BuCe12: Sébastien Bubeck, Nicolò CesaBianchi (2012) Regret Analysis of Stochastic and Nonstochastic Multiarmed Bandit Problems. Foundations and Trends® in Machine Learning, 5(1), 1–122. DOI
 SuBa98: Richard S Sutton, Andrew G Barto (1998) Reinforcement learning. Cambridge, Mass.: MIT Press
 DaWa00: Peter Dayan, Christopher JCH Watkins (n.d.) Reinforcement Learning. In Encyclopedia of Cognitve Science.
 KaLM96: L. P. Kaelbling, M. L. Littman, A. W. Moore (1996) Reinforcement Learning: A Survey. Journal of Artifical Intelligence Research, 4.
 JaSJ95: Tommi Jaakkola, Satinder P. Singh, Michael I. Jordan (1995) Reinforcement learning algorithm for partially observable Markov decision problems. In Advances in neural information processing systems (pp. 345–352).
 Krak16: Marina Krakovsky (2016) Reinforcement Renaissance. Commun. ACM, 59(8), 12–14. DOI
 HSWR13: Katja Hofmann, Anne Schuth, Shimon Whiteson, Maarten de Rijke (2013) Reusing Historical Interaction Data for Faster Online Learning to Rank for IR.
 SoXP15: Ruiyang Song, Yao Xie, Sebastian Pokutta (2015) Sequential Information Guided Sensing. ArXiv:1509.00130 [Cs, Math, Stat].
 BuSl12: Sebastien Bubeck, Aleksandrs Slivkins (2012) The best of both worlds: stochastic and adversarial bandits. ArXiv:1202.4473 [Cs].
 LCLW11: Lihong Li, Wei Chu, John Langford, Xuanhui Wang (2011) Unbiased Offline Evaluation of Contextualbanditbased News Article Recommendation Algorithms. In Proceedings of the Fourth International Conference on Web Search and Web Data Mining (WSDM11) (pp. 297–306).
 Powe09: Warren B. Powell (2009) What you should know about approximate dynamic programming. Naval Research Logistics (NRL) , 56(3), 239–249.