Bandit problems, Markov decision processes, a smattering of dynamic programming, game theory, optimal control, and online learning of the solutions to such problems, esp.Â reinforcement learning.

Learning, where you must learn an optimal action in response to your stimulus, possibly an optimal â€śpolicyâ€ť of trying different actions over time, not just an MMSE-minimal prediction from complete data.

Comes in adversarial, mMarkov, and stochastic flavours, apparently, although Iâ€™ve hit the boundaries of my knowledge there.

## Pseudopolitical diversion

See clickbait bandits.

## Details

đźš§

Conceptually, the base model here is a one- or many-armed poker machine. You can pop coins in, and each time you do you may pull an arm; you might get rewarded. Each arm of the machine might have different returns; but the only way to find out is to play.

How do you choose optimally which arms to pull and when? How much is to work spending to find the arm with the best return on investment, given that it costs to collect more data?

This can be formalised by minimizing *regret* and defining some other terms and what you get out is a formalized version of Skinnerian learning that you can easily implement as an algorithm and feel that you have got some satisfyingly optimal properties for.

The thing about whether to choose a new arm when you are on a known good one, or to switch to another one in hope of it being better, this is a symbolic one and itâ€™s called the *exploration/exploitation trade-off*.

### Delayed/sparse reward

Not really an ex

Richard Xuâ€™s take:

- No data label as in supervised learning, only a
*reward*signal - feedback is delayed
- data are not i.i.d. but sequentially dependent
- agentâ€™s actions affect the data it receives.

## Multi-world testing

A setting by Microsoft: Multi-World Testing (MWT) appears to be an online learning problem that augments its data by re-using the data for offline testing:

â€¦ is a toolbox of machine learning technology for principled and efficient experimentation, plausibly applicable to most Microsoft services that interact with customers. In many scenarios, this technology is exponentially more efficient than the traditional A/B testing. The underlying research area, mature and yet very active, is known under many names: â€śmulti-armed banditsâ€ť, â€ścontextual banditsâ€ť, â€śassociative reinforcement learningâ€ť, and â€ścounterfactual evaluationâ€ť, among others.

To take an example, suppose one wants to optimize clicks on suggested news stories. To discover what works, one needs to explore over the possible news stories. Further, if the suggested news story can be chosen depending on the visitorâ€™s profile, then one needs to explore over the possible â€śpoliciesâ€ť that map profiles to news stories (and there are exponentially more â€śpoliciesâ€ť than news stories!). Traditional ML fails at this because it does not explore. Whereas MWT allows you to explore continuously, and optimize your decisions using this exploration data.

It has a better introduction here, by John Langford: (cf (Agarwal et al. 2015).)

Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.

Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what weâ€™ve done and now it can be used by anyone. This drops the barrier to use down to: â€śDo you have permissions? And do you have a reasonable idea of what a good feature is?â€ť

A key foundational idea is Multiworld Testing: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system weâ€™ve made at Microsoft Research.

## Extensions

### Deep reinforcement learning

Of course, artificial neural networks are a thing in this domain too.

See Andrej Karpathyâ€™s explanation.

Casual concrete example and intro by Mat Kelcey.

The trick is you approximate the action table in Q-learning using a neural net.

Implementation: keras-rl

## Intros

Hereâ€™s an intro to all of machine learning through a historical tale about how one particular attempt to teach a machine (not a computer!) to play tic-tac-toe: Rodney Brooks, Machine Learning Explained.

### Practice

- Denny Britz exercises
- Arthur Julianiâ€™s Simple Reinforcement Learning in Tensorflow
- Emami, Deep deterministic Policy Gradients in Tensorflow

### Theory

SĂ©bastien Bubeckâ€™s intro (complements and updates (Bubeck and Cesa-Bianchi 2012))

Sergey Feldman, Bandits for Recommendation Systems is an EZ-introduction.

The internet loveâ€™s David Silverâ€™s course.

Richard S. Sutton and Andrew G. Bartoâ€™s Reinforcement Learning: An Introduction

Langfordâ€™s 2013 NIPS presentation ((Langford 2013))

Elad Hazanâ€™s origin story for online convex optimisation has just made me realise how new it is. It complementsâ€¦

Elad Hazan and Satyan Kaleâ€™s tutorial on online convex optimisation.

The aging, but gentle intro resource, AI-depotâ€™s Reinforcement learning page

### Bandits-meet-optimisation

Bubeck again: Kernel-based methods for bandit convex optimization, part 1.

### Bandits-meet-evolution

Ben Recht and Roy Frostig, Nesterovâ€™s punctuation equilibrium:

In a landmark new paper by Salimans, Ho, Chen, and Sutskever from OpenAI, (Salimans et al. 2017) the authors show that a particular class of genetic algorithms (called Evolutionary Strategies) gives excellent performance on a variety of reinforcement learning benchmarks. As optimizers, the application of genetic algorithms raises red flags and usually causes us to close browser Windows. But fear not! As we will explain, the particular algorithm deployed happens to be a core method in optimization, and the fact that this method is successful sheds light on the peculiarities of reinforcement learning more than it does about genetic algorithms in general.

## Markov decision problems

Bellman and Howardâ€™s classic discrete time control stochastic problem

- Warren Powellâ€™s Introduction to markov decision processes

## POMDP

Too many CPU cycles? Why not look at optimal control in this setting?

â€śA POMDP is a partially observable Markov decision process. It is a model, originating in the operations research (OR) literature, for describing planning tasks in which the decision maker does not have complete information as to its current state. The POMDP model provides a convenient way of reasoning about tradeoffs between actions to gain reward and actions to gain information.â€ť

## Practicalities

Vowpal Wabbit does contextual bandit learning.:

VW contains a contextual bandit module which allows you to optimize a predictor based on already collected contextual bandit data. In other words, the module does not handle the exploration issue, it assumes it can only use the currently available data previously collected from some â€śexplorationâ€ť policy.

Multiple World Testing source code.

## Sequential surrogate interactive model optimisation

Not what I think of here, but it and many other hyperparameter optimization-type problems have an RL interpretation. That is, you can use RL to learn the hyper parameters of your deep learning model. (This is not the same as deep learning *of* RL policies.) See Sequential surrogate model optimisation for more of that, or the papers (Jamieson and Talwalkar 2015; L. Li et al. 2016a, 2016b).

# Refs

Agarwal, Alekh, Sarah Bird, Markus Cozowicz, Miro Dudik, John Langford, Lihong Li, Luong Hoang, Sid Sen, Robert Schapire, and Alex Slivkins. 2015. â€śIntroduction to Multi-World Testing.â€ť http://research.microsoft.com/en-us/projects/mwt/.

Agarwal, Aman, Soumya Basu, Tobias Schnabel, and Thorsten Joachims. 2017. â€śEffective Evaluation Using Logged Bandit Feedback from Multiple Loggers.â€ť In *Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD â€™17*, 687â€“96. Halifax, NS, Canada: ACM Press. https://doi.org/10.1145/3097983.3098155.

Allen-Zhu, Zeyuan, Yuanzhi Li, Aarti Singh, and Yining Wang. 2017. â€śNear-Optimal Design of Experiments via Regret Minimization.â€ť In *PMLR*, 126â€“35. http://proceedings.mlr.press/v70/allen-zhu17e.html.

Bellman, Richard. 1957. â€śA Markovian Decision Process.â€ť DTIC Document. http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=AD0606367.

Bellman, Richard, and Robert Kalaba. 1961. â€śA Note on Interrupted Stochastic Control Processes.â€ť *Information and Control* 4 (4): 346â€“49. https://doi.org/10.1016/S0019-9958(61)80050-8.

Bottou, Leon. 2014. â€śLearning to Interact.â€ť presented at the Microsoft Techfest.

Bottou, LĂ©on, Jonas Peters, Joaquin QuiĂ±onero-Candela, Denis X. Charles, D. Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. 2013. â€śCounterfactual Reasoning and Learning Systems: The Example of Computational Advertising.â€ť *Journal of Machine Learning Research* 14: 3207â€“60. http://jmlr.org/papers/v14/bottou13a.html.

Bubeck, SĂ©bastien, and NicolĂ˛ Cesa-Bianchi. 2012. â€śRegret Analysis of Stochastic and Nonstochastic Multi-Armed Bandit Problems.â€ť *Foundations and TrendsÂ® in Machine Learning* 5 (1): 1â€“122. https://doi.org/10.1561/2200000024.

Bubeck, SĂ©bastien, RĂ©mi Munos, and Gilles Stoltz. 2011. â€śPure Exploration in Finitelyâ€“Armed and Continuousâ€“Armed Bandits.â€ť *Theoretical Computer Science* 412: 1832â€“52. http://core.ac.uk:8081/download/pdf/1147325.pdf.

Bubeck, SĂ©bastien, and Aleksandrs Slivkins. 2012. â€śThe Best of Both Worlds: Stochastic and Adversarial Bandits,â€ť February. http://arxiv.org/abs/1202.4473.

Cesa-Bianchi, Nicolo, and GĂˇbor Lugosi. 2006. *Prediction, Learning, and Games*. Cambridge ; New York: Cambridge University Press.

Dayan, Peter, and Christopher JCH Watkins. n.d. â€śReinforcement Learning.â€ť In *Encyclopedia of Cognitve Science*.

Du, Simon, Akshay Krishnamurthy, Nan Jiang, Alekh Agarwal, Miro DudĂk, and John Langford. 2019. â€śProvably Efficient RL with Rich Observations via Latent State Decoding,â€ť January. https://www.microsoft.com/en-us/research/publication/provably-efficient-rl-with-rich-observations-via-latent-state-decoding/.

Hofmann, Katja, Anne Schuth, Shimon Whiteson, and Maarten de Rijke. 2013. â€śReusing Historical Interaction Data for Faster Online Learning to Rank for IR.â€ť https://staff.science.uva.nl/s.a.whiteson/pubs/hofmannwsdm13.pdf.

Howard, Ronald A. 1960. â€śDynamic Programming and Markov Processes..â€ť http://trid.trb.org/view.aspx?id=521429.

Jaakkola, Tommi, Satinder P. Singh, and Michael I. Jordan. 1995. â€śReinforcement Learning Algorithm for Partially Observable Markov Decision Problems.â€ť In *Advances in Neural Information Processing Systems*, 345â€“52. http://papers.nips.cc/paper/951-reinforcement-learning-algorithm-for-partially-observable-markov-decision-problems.pdf.

Jamieson, Kevin, and Ameet Talwalkar. 2015. â€śNon-Stochastic Best Arm Identification and Hyperparameter Optimization,â€ť February. http://arxiv.org/abs/1502.07943.

Kaelbling, L. P., M. L. Littman, and A. W. Moore. 1996. â€śReinforcement Learning: A Survey.â€ť *Journal of Artifical Intelligence Research* 4 (April). http://arxiv.org/abs/cs/9605103.

Krakovsky, Marina. 2016. â€śReinforcement Renaissance.â€ť *Commun. ACM* 59 (8): 12â€“14. https://doi.org/10.1145/2949662.

Krishnamurthy, Akshay, Alekh Agarwal, and John Langford. 2016. â€śContextual-MDPs for PAC-Reinforcement Learning with Rich Observations,â€ť February. http://arxiv.org/abs/1602.02722.

Langford, John. 2013. â€śLearning to Interact.â€ť presented at the NIPS 2013. http://hunch.net/~jl/interact.pdf.

Li, Lihong, Shunbao Chen, Jim Kleban, and Ankur Gupta. 2015. â€śCounterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study.â€ť In *Proceedings of the 24th International World Wide Web Conference (WWWâ€™14), Companion Volume*. ACM â€“ Association for Computing Machinery. http://research.microsoft.com/apps/pubs/default.aspx?id=240958.

Li, Lihong, Wei Chu, John Langford, and Xuanhui Wang. 2011. â€śUnbiased Offline Evaluation of Contextual-Bandit-Based News Article Recommendation Algorithms.â€ť In *Proceedings of the Fourth International Conference on Web Search and Web Data Mining (WSDM-11)*, 297â€“306. http://research.microsoft.com/apps/pubs/default.aspx?id=178905.

Li, Lisha, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. 2016a. â€śEfficient Hyperparameter Optimization and Infinitely Many Armed Bandits,â€ť March. http://arxiv.org/abs/1603.06560.

â€”â€”â€”. 2016b. â€śHyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization,â€ť March. http://arxiv.org/abs/1603.06560.

Parisotto, Emilio, and Ruslan Salakhutdinov. 2017. â€śNeural Map: Structured Memory for Deep Reinforcement Learning,â€ť February. http://arxiv.org/abs/1702.08360.

Pfau, David, and Oriol Vinyals. 2016. â€śConnecting Generative Adversarial Networks and Actor-Critic Methods,â€ť October. http://arxiv.org/abs/1610.01945.

Powell, Warren B. 2009. â€śWhat You Should Know About Approximate Dynamic Programming.â€ť *Naval Research Logistics (NRL)* 56 (3): 239â€“49. http://onlinelibrary.wiley.com/doi/10.1002/nav.20347/abstract.

Salimans, Tim, Jonathan Ho, Xi Chen, and Ilya Sutskever. 2017. â€śEvolution Strategies as a Scalable Alternative to Reinforcement Learning,â€ť March. http://arxiv.org/abs/1703.03864.

Shibata, Takeshi, Ryo Yoshinaka, and Takashi Chikayama. 2006. â€śProbabilistic Generalization of Simple Grammars and Its Application to Reinforcement Learning.â€ť In *Algorithmic Learning Theory*, edited by JosĂ© L. BalcĂˇzar, Philip M. Long, and Frank Stephan, 348â€“62. Lecture Notes in Computer Science 4264. Springer Berlin Heidelberg. http://link.springer.com/chapter/10.1007/11894841_28.

Si, Jennie. 2004. *Handbook of Learning and Approximate Dynamic Programming*. Vol. 2. John Wiley & Sons. http://books.google.ch/books?hl=en&lr=&id=JHvz1elubJQC&oi=fnd&pg=PR18&dq=Powell+approximate+dynamic+programming&ots=Uj-Ez7cs-Z&sig=TQC1Im6z8lz5JY0ZvPNOXh374jg.

Song, Ruiyang, Yao Xie, and Sebastian Pokutta. 2015. â€śSequential Information Guided Sensing,â€ť August. http://arxiv.org/abs/1509.00130.

Strehl, Alexander L., John Langford, Lihong Li, and Sham M. Kakade. 2011. â€śLearning from Logged Implicit Exploration Data.â€ť In *Advances in Neural Information Processing Systems 23 (NIPS-10)*, 2217â€“25. http://research.microsoft.com/apps/pubs/default.aspx?id=178844.

Su, Yi, Lequn Wang, Michele Santacatterina, and Thorsten Joachims. 2018. â€śCAB: Continuous Adaptive Blending Estimator for Policy Evaluation and Learning,â€ť November. http://arxiv.org/abs/1811.02672.

Sutton, Richard S, and Andrew G Barto. 1998. *Reinforcement Learning*. Cambridge, Mass.: MIT Press. http://lccn.loc.gov/97026416.

Sutton, Richard S., David A. McAllester, Satinder P. Singh, and Yishay Mansour. 2000. â€śPolicy Gradient Methods for Reinforcement Learning with Function Approximation.â€ť In *Advances in Neural Information Processing Systems*, 1057â€“63. http://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf.

Swaminathan, Adith, and Thorsten Joachims. 2015. â€śCounterfactual Risk Minimization.â€ť In *Proceedings of the 24th International Conference on World Wide Web - WWW â€™15 Companion*, 939â€“41. Florence, Italy: ACM Press. https://doi.org/10.1145/2740908.2742564.

Thrun, Sebastian B. 1992. â€śEfficient Exploration in Reinforcement Learning.â€ť http://www.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1992_1/thrun_sebastian_1992_1.pdf.