The Living Thing / Notebooks : Bandit problems, reinforcement learning, and stochastic control

Bandit problems, Markov decision problems, a smattering of dynamic programming, game theory, and online learning the solutions to such problems, esp. reinforcement learning.

Learning, where you must learn an optimal action in response to your stimulous, possibly an optimal policy of trying different actions over time, not just an MMSE-minimal prediction from complete data. Comes in adversarial and stochastic flavours, apparently, although I’ve hit the boundaries of my knowledge there.

(Clickbait) bandit problems

Contextual bandits.

There is a lot of stuff being done in bandit problems specifically to model consumers and web pages; Therefore I’ll talk about it from that perspective:

The science of treating consumers of modern news media like what they are, near-passive objects of surveillance and control. Because trying to rely on peoples’ rationality and agency to get things done has a poor track record in recent history.

Pseudopolitical diversion

The “bandit problems” phrase comes, by the way, from an extension of the “one armed bandit”, the poker machine, into a mathematical model for exploring the world by pulling on the arms of a poker machine.

Pseudopolitical diversion: There is a pleasing symmetry in that modern poker machines, and indeed the internet in general, model the customer as a metaphorical poker machine upon whose arm they pull to get a reward, and that this reward is addicting the customer to pulling on the arms of their literal poker machine. It’s a two-way battle of algorithms, but one side does not update its learning algorithms.

You should read this next one before you blame someone (especially a millenial, especially if you are not a millenial) for having no attention span, then take a deep look into your soul; Michael Schulson, if the internet is addictive, why don’t we regulate it?

As a consultant to Silicon Valley startups, Eyal helps his clients mimic what he calls the ‘narcotic-like properties’ of sites such as Facebook and Pinterest. His goal, Eyal told Business Insider, is to get users ‘continuing through the same basic cycle. Forever and ever.’

[…] For a tech company in the attention economy, the longer you’re engaged by variable rewards, the more time you spend online, and the more money they make through ad revenue.

Yet we keep blaming people.

Stupid rats, running the mazes we set them, instead of dotcom startups.

Back to the technical details

Also, there’s interesting mathematics! social graphs! Self-exciting point processes! And all the bandit problem literature!

New tool by Microsoft: Multi-World Testing (MWT) appears to be an online learning problem that augments its data by re-using the data for offline testing:

… is a toolbox of machine learning technology for principled and efficient experimentation, plausibly applicable to most Microsoft services that interact with customers. In many scenarios, this technology is exponentially more efficient than the traditional A/B testing. The underlying research area, mature and yet very active, is known under many names: “multi-armed bandits”, “contextual bandits”, “associative reinforcement learning”, and “counterfactual evaluation”, among others.

To take an example, suppose one wants to optimize clicks on suggested news stories. To discover what works, one needs to explore over the possible news stories. Further, if the suggested news story can be chosen depending on the visitor’s profile, then one needs to explore over the possible “policies” that map profiles to news stories (and there are exponentially more “policies” than news stories!). Traditional ML fails at this because it does not explore. Whereas MWT allows you to explore continuously, and optimize your decisions using this exploration data.

It has a better introduction here, by John Langford: (cf ABCD15, ABCH16)

Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.

Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what we’ve done and now it can be used by anyone. This drops the barrier to use down to: “Do you have permissions? And do you have a reasonable idea of what a good feature is?”

A key foundational idea is Multiworld Testing: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system we’ve made at Microsoft Research.


Deep reinforcement learning

Of course, artificial neural networks are a thing in this domain too.

Casual concrete example and intro by Mat Kelcey.

implementation: keras-rl

Markov decision problems

Bellman and Howard’s classic discrete time control stochastic problem


Too many CPU cycles? Why not look at optimal control in this setting?

“A POMDP is a partially observable Markov decision process. It is a model, originating in the operations research (OR) literature, for describing planning tasks in which the decision maker does not have complete information as to its current state. The POMDP model provides a convenient way of reasoning about tradeoffs between actions to gain reward and actions to gain information.”


Vowpal Wabbit does contextual bandit learning.:

VW contains a contextual bandit module which allows you to optimize a predictor based on already collected contextual bandit data. In other words, the module does not handle the exploration issue, it assumes it can only use the currently available data previously collected from some “exploration” policy.

Multiple World Testing source code.

To read

Agarwal, A., Bird, S., Cozowicz, M., Dudik, M., Langford, J., Li, L., … Slivkins, A. (2015) Introduction to Multi-World Testing.
Agarwal, A., Bird, S., Cozowicz, M., Hoang, L., Langford, J., Lee, S., … Slivkins, A. (2016) A Multiworld Testing Decision Service. arXiv:1606.03966 [cs].
Anders, T., & Miranda, E. R.(2009) A computational model that generalises Schoenberg’s guidelines for favourable chord progressions. In Proceedings of the Sound and Music Computing Conference. Citeseer
Bellman, R. (1957) A Markovian decision process. . DTIC Document
Bellman, R., & Kalaba, R. (1961) A note on interrupted stochastic control processes. Information and Control, 4(4), 346–349. DOI.
Bottou, L. (2014) Learning to Interact. . Presented at the Microsoft Techfest
Bottou, L., Peters, J., Quiñonero-Candela, J., Charles, D. X., Chickering, D. M., Portugaly, E., … Snelson, E. (2013) Counterfactual Reasoning and Learning Systems: The Example of Computational Advertising. Journal of Machine Learning Research, 14, 3207–3260.
Bubeck, S., & Cesa-Bianchi, N. (2012) Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Foundations and Trends® in Machine Learning, 5(1), 1–122. DOI.
Bubeck, S., Munos, R., & Stoltz, G. (2011) Pure Exploration in Finitely–Armed and Continuous–Armed Bandits. Theoretical Computer Science, 412, 1832–1852.
Bubeck, S., & Slivkins, A. (2012) The best of both worlds: stochastic and adversarial bandits. arXiv:1202.4473 [cs].
Cesa-Bianchi, N., & Lugosi, G. (2006) Prediction, learning, and games. . Cambridge ; New York: Cambridge University Press
Dayan, P., & Watkins, C. J.(n.d.) Reinforcement Learning. In Encyclopedia of Cognitve Science.
Drummond, C. (2011) Accelerating Reinforcement Learning by Composing Solutions of Automatically Identified Subtasks. Journal Of Artificial Intelligence Research. DOI.
Hofmann, K., Schuth, A., Whiteson, S., & de Rijke, M. (2013) Reusing Historical Interaction Data for Faster Online Learning to Rank for IR.
Howard, R. A.(1960) Dynamic Programming and Markov Processes..
Kaelbling, L. P., Littman, M. L., & Moore, A. W.(1996) Reinforcement Learning: A Survey. Journal of Artifical Intelligence Research, 4.
Krishnamurthy, A., Agarwal, A., & Langford, J. (2016) Contextual-MDPs for PAC-Reinforcement Learning with Rich Observations. arXiv:1602.02722 [cs, Stat].
Langford, J. (2013) Learning to interact. . Presented at the NIPS 2013
Li, L., Chen, S., Kleban, J., & Gupta, A. (2015) Counterfactual Estimation and Optimization of Click Metrics in Search Engines: A Case Study. In Proceedings of the 24th International World Wide Web Conference (WWW’14), Companion Volume. ACM – Association for Computing Machinery
Li, L., Chu, W., Langford, J., & Wang, X. (2011) Unbiased Offline Evaluation of Contextual-bandit-based News Article Recommendation Algorithms. In Proceedings of the Fourth International Conference on Web Search and Web Data Mining (WSDM-11) (pp. 297–306).
Powell, W. B.(2009) What you should know about approximate dynamic programming. Naval Research Logistics (NRL), 56(3), 239–249.
Shibata, T., Yoshinaka, R., & Chikayama, T. (2006) Probabilistic Generalization of Simple Grammars and Its Application to Reinforcement Learning. In J. L. Balcázar, P. M. Long, & F. Stephan (Eds.), Algorithmic Learning Theory (pp. 348–362). Springer Berlin Heidelberg
Si, J. (2004) Handbook of learning and approximate dynamic programming. (Vol. 2). John Wiley & Sons
Song, R., Xie, Y., & Pokutta, S. (2015) Sequential Information Guided Sensing. arXiv:1509.00130 [cs, Math, Stat].
Strehl, A. L., Langford, J., Li, L., & Kakade, S. M.(2011) Learning from Logged Implicit Exploration Data. In Advances in Neural Information Processing Systems 23 (NIPS-10) (pp. 2217–2225).
Sutton, R. S., & Barto, A. G.(1998) Reinforcement learning. . Cambridge, Mass.: MIT Press
Thrun, S. B.(1992) Efficient Exploration In Reinforcement Learning.