Generalized Thompson sampling for sequential decisionmaking and causal inference
 Pedro A Ortega^{1}Email author and
 Daniel A Braun^{2}Email author
https://doi.org/10.1186/2194320622
© Ortega and Braun; licensee Springer. 2014
Received: 11 November 2013
Accepted: 17 February 2014
Published: 14 March 2014
The Erratum to this article has been published in Complex Adaptive Systems Modeling 2014 2:4
Abstract
Purpose
Sampling an action according to the probability that the action is believed to be the optimal one is sometimes called Thompson sampling.
Methods
Although mostly applied to bandit problems, Thompson sampling can also be used to solve sequential adaptive control problems, when the optimal policy is known for each possible environment. The predictive distribution over actions can then be constructed by a Bayesian superposition of the policies weighted by their posterior probability of being optimal.
Results
Here we discuss two important features of this approach. First, we show in how far such generalized Thompson sampling can be regarded as an optimal strategy under limited information processing capabilities that constrain the sampling complexity of the decisionmaking process. Second, we show how such Thompson sampling can be extended to solve causal inference problems when interacting with an environment in a sequential fashion.
Conclusion
In summary, our results suggest that Thompson sampling might not merely be a useful heuristic, but a principled method to address problems of adaptive sequential decisionmaking and causal inference.
Keywords
Thompson sampling Adaptive control Bounded rationality Decisionmaking Causal inferenceBackground
In a research paper from 1933, Thompson studied the problem of finding out which one of two drugs was better when testing them on a patient population under the constraint that as few people as possible should be subjected to the inferior drug during the course of testing (Thompson [1933]). Given a current (Bayesian) probability estimate P of one treatment being better than the other, he suggested that it might be a good idea to adjust the proportions of future test subjects that take the two drugs to the respective probabilities P and 1P. This way one would not run into the danger of permanently cutting off all future test subjects from a potentially superior treatment that so far seems inferior due to statistical fluctuations, while only temporarily risking exposure to a potentially inferior drug for a smaller proportion of the population. Randomizing actions based on the probability that this action is believed to be optimal when faced with an unknown environment is now sometimes called Thompson sampling.
Today, Thompson’s problem is generally thought of as a bandit problem that consists in determining which lever to pull at which point in time when facing a set of onearmed slot machines, each one having an unknown distribution over a reward variable (Russell and Norvig [1995]; Sutton and Barto [1998]). In the case of known prior probabilities and geometrically discounted future rewards, Gittins ([1979]) provides an optimal policy for the bandit problem that maximizes the expected future cumulative discounted reward. In contrast, Thompson sampling is usually considered as a heuristic approach to solve bandit problems (Wyatt [1997]; Granmo [2008, 2010]; Asmuth et al. [2009]; Graepel et al. [2010]; Scott [2010]; May and Leslie [2011]; Chapelle and Li [2011]; Agrawal and Goyal [2011]; Granmo and Glimsdal [2013]; May et al. [2012]; Kaufmann et al. [2012]; Russo and Roy [2013]; Korda et al. [2013]; Bubeck and Liu [2013]). However, the basic idea of Thompson sampling—that is, sampling actions from a mixture distribution of policies according to their probability of being optimal–can also be applied to solve more general problems in sequential adaptive control (Dearden et al. [1998]; Strens [2000]; Ortega and Braun [2010a, 2010b, 2012a]; Braun and Ortega [2010]; Osband and Russo [2013]; Cao and Ray [2012]; Tziortziotis et al. [2013a, 2013];x Dimitrakakis [2013]; Dimitrakakis and Tziortziotis [2013]; Mellor and Shapiro [2013]).
 1.
Thompson sampling can be considered as an application of Bayes’ rule for acting where actions are treated as causally intervened random variables within the framework of statistical causality.
 2.
Thompson sampling can be considered as a form of optimal adaptive control under bounded rationality where limited information processing capabilities are modeled by entropic search costs.
 3.
Thompson sampling provides a natural strategy for causal induction when interacting with an environment with unknown causal structure.
Although the third section contains an algorithmic extension to previous work (Ortega and Braun [2010a]; [2010b]), it should be emphasized that the main contribution of the paper is not so much to present a novel algorithm, but to discuss basic properties of Thompson sampling, in particular how it relates to the informationtheoretic bounded rationality model in (Ortega and Braun [2013]), how this boundedness can be interpreted in terms of sampling complexity, and how this method can be applied to solve problems of causal inference.
The paper is structured as follows. In Section “Problem statement” we clarify the problem statement and recapitulate the main result of (Ortega and Braun [2010b]). In Section “Decisionmaking with limited resources” we analyze the decisionmaking problem faced by agents that are unable to compute the single best policy. In Section “Causal induction” we investigate how this approach can be applied to adaptive agents that need to discover the causal structure of their environment. Finally, we discuss the significance of these results in Section “Discussion”.
Methods
Problem statement
In an adaptive control problem a decisionmaker faces an environment Q_{ θ } drawn from a set of potential environments $\mathcal{Q}=\left\{{Q}_{\theta}\right\theta \in \Theta \}$. In general θ could be a continuous variable, but we restrict our exposition to the discrete case. Each environment Q_{ θ } can be characterized by a set of conditional distributions Q(o_{ t }θ,a_{≤t},o_{<t}) that indicate the probability of observing o_{ t } given past observations o_{<t}=o_{1}…o_{t1} and past actions a_{≤t}=a_{1}…a_{ t }. This class of environments is very general, and it encompasses multiarmed bandits, (partially observable) Markov decision processes and others –compare Chapter 3 (Legg [2008]). To allow for selfoptimizing agents, the environment is typically assumed to be ergodic, so agents can recover from their mistakes –compare Section 3.5 (Legg [2008]). The decisionmaker has perfectly fitting prediction models P(o_{ t }θ,a_{≤t},o_{<t})=Q(o_{ t }θ,a_{≤t},o_{<t}), but is uncertain about θ. The uncertainty about θ can be represented by a prior distribution P(θ). The interaction proceeds as follows. First an environment θ is sampled from P(θ). The agent picks an action a_{0} and receives an observation o_{0}, to which the agent responds with a_{1} and receives observation o_{1} etc. The agent’s policy can be described by a set of conditional distributions P(a_{ t }o_{<t},a_{<t}).
Problem statement: decisiontheory
In order to solve the problem within the framework of maximum expected utility theory, one requires

a prior P(θ) over possible environments Q_{ θ }

a class of prediction models P(o_{ t }θ,a_{≤t},o_{<t})

a utility function U(o_{≤T},a_{≤T}).
maximizes the expected utility under the mixture distribution (Hutter [2004]). Equations (1) and (2) define a Bayesian adaptive control problem (Martin [1967]; Duff [2002]). This problem formulation becomes quickly intractable, as the number of reachable information states grows exponentially in the time horizon (Duff [2002]).
Problem statement: probability theory & statistical causality
Ignoring the notion of utility for a moment and treating actions purely as (causally intervened) random variables (Pearl [2000]), one could think of another kind of adaptive control problem that is defined entirely in probabilistic and causal terms. This requires the following ingredients

a prior P(θ) over possible environments Q_{ θ }

a class of prediction models P(o_{ t }θ,a_{≤t},o_{<t})

a class of policy models P(a_{ t }θ,a_{<t},o_{<t})
As there can be only one action at any one time, single actions can be obtained as samples from $P\left({a}_{t}\right{\xe2}_{<t},{o}_{<t})$. Importantly, sampling from $P\left({a}_{t}\right{\xe2}_{<t},{o}_{<t})$ is equivalent to first sampling a random belief θ from the posterior $P\left(\theta \right{\xe2}_{<t},{o}_{<t})$ and then sampling an action from $P\left({a}_{t}\right\theta ,{\xe2}_{<t},{o}_{<t})$. This componentwise sampling from a mixture distribution is known as hierarchical sampling and corresponds here to a generalized Thompson sampling procedure, where first a random belief is sampled and then the associated policy with respect to this belief is executed. If we assume now that each of the custombuilt policies is optimal in their respective environments, we effectively select an action according to the probability that it is the optimal action, because we first sample the environment θ according to its posterior probability of being the true environment and then we perform the policy that is optimal in that environment. The question is how this problem formulation can be reconciled with a decisiontheoretic problem statement that involves utilities. This is the topic of Section “Decisionmaking with limited resources”.
Statistical causality
While both actions and observations are treated as random variables in (3), there is an important difference between actions and observations. Observations are produced by the environment and can be used to update the agent’s state of knowledge about the environment. In contrast, actions are set by the agent itself and hence they do not provide information about the environment. This distinction becomes crucial when conditioning on the history of actions and observations. The theory that deals with the distinction between exogenous and endogenous information is statistical causality (Pearl [2000]; Glymour et al. [2000]).
What is a causal intervention?
where w and b are the weather and the Barometer variables respectively, P(w) is the prior probability of the weather (e.g. good or bad) and P(bw) is the likelihood of the barometer change given the weather. The posterior P(wb) allows us to infer the weather from the barometer reading.
Now, imagine you decide to change the level of the mercury yourself, say (using a bit of imagination) by means of a pressurizing device. Now, you set the value of the random variable—and intuition tells us that we cannot predict the weather anymore from the barometer reading. Apparently, our previous Bayesian model is useless now. This shouldn’t come as a surprise, as our intervention effectively changed the relation between the barometer and the weather.
In other words, we don’t gain knowledge about the weather—as expected. Notice that intervening the alternative factorization, $P(\widehat{b},w)=\delta \left(b\right)P\left(w\rightb)$, would give a different result that is inconsistent with our causal story: we have assumed that the mercury level of the barometer depends functionally on the weather, and not the other way around. The reason for this special treatment of actions is that when we set the value of a random variable ourselves, we change Nature’s probability law.
Causal interventions in Thompson sampling
where we first expand the probabilities in terms of the joint distribution, second rewrite the joint distribution as the causal factorization, third remove the intervention tags from the intervened random variables that are in the probability conditions (Pearl’s second rule of docalculus (Pearl [2000])), and fourth replace each conditional probability having an intervened variable in the argument by a delta function over its chosen value—compare Chapter 4.2 in (Pearl [2000]).
obtained only by applying probability theory and causal calculus.
Results
Decisionmaking with limited resources
respectively, where the labels (a) and (b) correspond to the labels in Figure 1. Here it is clearly seen that the difference between the two lies in the order in which we apply the expectation (over the environment parameter) and the maximization operator. It should also be noted that the expected utility of (a) is an upper bound on the expected utility of (b). Yet, both cases can constitute optimal decisions depending on constraints. In (a), the decisionmaker picks his action taking into account the uncertainty over the bias, while in (b), the decisionmaker picks his action only after his beliefs over the coin bias are instantiated—that is, he is optimal w.r.t. his random beliefs. Here we consider how this optimality w.r.t. random beliefs can be considered as a form of optimal decisionmaking under information processing constraints.
Modeling bounded rational decisionmaking
Here we consider a particular informationtheoretic model of bounded rational decisionmaking that formalizes limited information processing resources by a variational principle that trades off expected utility gains (or losses) and entropic information costs (Ortega [2011a]; Ortega and Braun [2011, 2012a, 2013]). Information processing costs are usually ignored in the study of perfectly rational decisionmakers. Given a choice set $\mathcal{X}$ with choices $x\in \mathcal{X}$ and utilities U(x), a perfectly rational decisionmaker would always choose the best option x^{∗}= arg maxx U(x)—presupposing there is a unique maximum. In general, a bounded rational decisionmaker is unable to pick out the best option with certainty, and his choice can be described by a probability distribution P(x) reflecting uncertainty. Improving the choice strategy P(x) can be understood as a costly search process.
Let us assume the initial strategy of the decisionmaker can be described by a probability distribution P_{0}(x). The search process for the optimum transforms this initial choice into a final choice P(x). In case of the perfectly rational decisionmaker the final choice is $P\left(x\right)={\delta}_{x,{x}^{\ast}}$. In the general case of the bounded rational decisionmaker the search is costly and he will not be able to afford such a stark reduction in uncertainty. Assuming that search costs are realvalued, additive and higher for rare events (Ortega and Braun [2010c]), it can be shown that the cost of the search is determined by the information distance D_{ K L } between P_{0} and P, that is ${D}_{\mathit{\text{KL}}}=\sum _{x}P\left(x\right)log\frac{P\left(x\right)}{{P}_{0}\left(x\right)}$. Both Bayesian search (Jaynes [1985]) and Koopman’s random search (Stone [1998]) are compatible with these assumptions, as well as energetic costs that would have to be paid by a Maxwellian demon for reducing uncertainty in statistical physical systems (Ortega and Braun [2013]). How this informationtheoretic model of search costs relates to computational resources such as space and time complexity is still an open problem (Vitanyi [2005]).
Simple decisions
where ${\mathcal{U}}_{max}$ is the uniform distribution over the maximizing subset ${\mathcal{X}}_{max}:=\{x\in \mathcal{X}:U(x)=\underset{{x}^{\prime}}{max}U({x}^{\prime}\left)\right\}$. It can be seen that a perfectly rational agent with α→∞ is able to pick out the optimal action—which is a deterministic policy in the case of a single optimum—, whereas finitely rational agents have stochastic policies with nonzero probability of picking a suboptimal action.
because $\sum _{x}p\left(x\right)U\left(x\right)\le T$, that is the exponential of the KullbackLeibler divergence provides a lower bound on the average number of samples.
Decisions in the presence of latent variables
To model a Thompson sampling agent, we need at least a twostep decision with a variable x that has to be chosen by the agent, and a variable θ that is chosen by the environment. In the example described in Figure 1, the variable x is the agent’s prediction for the outcome of a coin toss, the variable θ indicates nature’s choice which one of the two coins is tossed. The agent’s prediction can take on the values x=H and x=T corresponding to the outcomes Head and Tail. The variable θ takes on the two values $\theta =\frac{1}{4}$ and $\theta =\frac{3}{4}$ corresponding to the biases of the two coins. The prior probability over θ is ${p}_{0}\left(\theta =\frac{1}{4}\right)=\frac{1}{4}$ and ${p}_{0}\left(\theta =\frac{3}{4}\right)=\frac{3}{4}$. The expected rewards for all combinations of x and θ are then $U\left(x=H,\theta =\frac{1}{4}\right)=\frac{1}{4}$, $U\left(x=T,\theta =\frac{1}{4}\right)=\frac{3}{4}$, $U\left(x=H,\theta =\frac{3}{4}\right)=\frac{3}{4}$ and $U\left(x=T,\theta =\frac{3}{4}\right)=\frac{1}{4}$.
and the normalization constant ${Z}_{\mathrm{\alpha \beta}}=\sum _{x}{p}_{0}\left(x\right)exp\left(\frac{\alpha}{\beta}log{Z}_{\beta}\left(x\right)\right).$ From Equation (17) we can derive both the perfectly rational decisionmaker and the Thompson sampling agent. To simplify, we assume in the following that the agent has no prior preference for x, that is ${p}_{0}\left(x\right)=\mathcal{U}\left(x\right)$.
The perfectly rational expected utility maximizer as depicted in Figure 1a is then obtained from Equation (18) by taking the limit α→∞.
The resulting agent is a probabilistic superposition of agents that act optimally for any given θ as depicted in Figure 1b. It can be seen that in Equation (19) and in Equation (18) the order of the expectation operation and the (soft)maximization operation are reversed.
with $u\sim \mathcal{U}[\phantom{\rule{0.3em}{0ex}}0;1]$ and T as set above. Since α≫β we can assume α≈N β with $N\in \mathbb{N}$, and we can see easily that the perfectly rational agent will require infinitely many θ samples (α→∞ and β→0) to obtain one sample of x, whereas the Thompson sampling agent will only require a single sample (α=β). The Thompson sampling agent is therefore the agent that can solve the optimization problem of Equation (16) for a given α with the least amount of samples. This can also be seen from Equation (18), when doing the Monte Carlo approximation $\sum _{\theta}{p}_{0}\left(\theta \right)U(x,\theta )\approx \frac{1}{N}\sum _{i}U(x,{\theta}_{i})$ by drawing N samples θ_{ i }∼p_{0}(θ_{ i }). For infinitely many samples, the average approximates the expectation, for a single sample we can rewrite Equation (18) into Equation (19). This sampling procedure also allows estimating the upper and lower bounds of the optimal utility (Tziortziotis et al. [2013]). Of course, the Thompson sampling agent will not achieve the same expected utility as the perfectly rational agent. But both agents can be considered optimal under particular information processing constraints.
Causal induction
A generalized Thompsonsampling agent can be thought of as a probabilistic superposition of models θ, where each model θ is characterized by a likelihood model P(o_{ t }θ,a_{≤t},o_{<t}) and a policy model P(a_{ t }θ,a_{<t},o_{<t}). In previous applications we assumed that all models θ have the same causal structure, i.e. considering multivariate random variables a_{ t } and o_{ t }, we assumed that the same variables a_{ t } are intervened for all θ and the same causal model is used to predict the consequences of these interventions on the observational variables o_{ t }. However, this need not be the case. In principle, different models θ could represent different causal structures and suggest intervention of different variables. Such a setup can be used for causal induction as illustrated in the following example.
Imagine we are working on a medical treatment that involves two gene sites X and Y, each of which can be active or inactive. We encode the 'on’ and 'off’ states of X as X=x and X=¬x and similarly Y=y and Y=¬y to denote the 'on’ and 'off’ states of Y. Assume we are unsure about the causal mechanism between the two variables, that is we are unsure whether the activity of X influences the activity of Y or the other way around. Formally, we are interested in the explanatory power of two competing causal hypotheses: either 'X causes Y’ (Θ=θ) or 'Y causes X’ (Θ=¬θ). Assume our goal is to have Y in an active state, but that it is much cheaper and easier to manipulate X instead of Y. This leaves us with the following policies. If X causes Y we prefer to manipulate X, because it is cheap and easy. If Y causes X we have no other choice, but to directly manipulate Y. When manipulating either gene, we can be 100% sure that the new state of the gene is set by us, but we only have a 50% chance that the state will be 'on’. Assume not manipulating either variable is not an option, because then both gene sites stay inactive. The question is how should we act if we do not know the causal dependency?
 1.
Causal precedence: A node causally precedes its descendants. For instance, the root node corresponding to the sure event Ω causally precedes all other nodes.
 2.
Resolution of variables: Each node resolves the value of a random variable. For instance, given the node corresponding to Θ=θ and X=¬x, either Y=y will happen with probability $P\left(y\right\theta ,\neg x)=\frac{1}{4}$ or Y=¬y with probability $P(\neg y\theta ,\neg x)=\frac{3}{4}$.
 3.
Heterogeneous order: The resolution order of random variables can vary across different branches. For instance, X precedes Y under Θ=θ, but Y precedes X under Θ=¬θ. This is precisely how we model competing causal hypotheses.
While the probability tree represents the agent’s subjective model explaining the order in which the random values are resolved, it does not necessarily correspond to the temporal order in which the events are revealed to us. So for instance, under hypothesis Θ=θ, the value of the variable Y might be revealed before X, even though X causally precedes Y; and the causal hypothesis Θ, which precedes both X and Y, is never observed.
 1.Manipulating X: First, the agent instantiates his random beliefs by sampling the value of Θ from the prior:$P(\Theta =\theta )=P(\Theta =\neg \theta )=\frac{1}{2}.$Assume that the result is θ. Treating θ as if it was the true parameter, he proceeds to sample the action from P(Xθ) given by$P(X=x\theta )=P(X=\neg x\left\theta \right)=\frac{1}{2},$as indicated in the left branch of the probability tree. Assume that outcome is x, and this is the action that the agent executes. Because of this, the agent has to update its beliefs first by intervening the probability tree for $\widehat{x}$ and second by conditioning on x. The intervention $\widehat{x}$ is carried out by replacing all the nodes in the tree that resolve the value of X with new nodes assigning probability one to x and zero to ¬x. Figure 2b illustrates the result of this intervention. The posterior is then given by$P\left(\theta \right\widehat{x})=\frac{P\left(\widehat{x}\right\theta \left)P\right(\theta )}{P\left(\widehat{x}\right\theta \left)P\right(\theta )+P(\widehat{x}\neg \theta )P(\neg \theta )}=\frac{1\xb7\frac{1}{2}}{1\xb7\frac{1}{2}+1\xb7\frac{1}{2}}=\frac{1}{2}.$
In other words, the agent has switched on X, and has so far learned nothing from this fact.
 2.Observing Y: Now, the agent observes the activity of Y, and assume that it is active, i.e. Y=y. Then, the posterior beliefs of the agent are given as$\begin{array}{ll}P\left(\theta \right\widehat{x},y)& =\frac{P\left(y\right\theta ,\widehat{x}\left)P\right(\widehat{x}\left\theta \right)P\left(\theta \right)}{P\left(y\right\theta ,\widehat{x}\left)P\right(\widehat{x}\left\theta \right)P\left(\theta \right)+P\left(\widehat{x}\right\neg \theta ,y\left)P\right(y\neg \theta )P(\neg \theta )}\phantom{\rule{2em}{0ex}}\\ =\frac{\frac{3}{4}\xb71\xb7\frac{1}{2}}{\frac{3}{4}\xb71\xb7\frac{1}{2}+1\xb7\frac{1}{2}\xb7\frac{1}{2}}=\frac{3}{5}.\phantom{\rule{2em}{0ex}}\end{array}$Since $P\left(\theta \right)<P\left(\theta \right\widehat{x},y)$, the agent has gathered evidence favoring the hypothesis “X causes Y”. This was only possible because the intervention introduced a statistical asymmetry among the two hypotheses that did not exist in the beginning. In comparison, if the action is not treated as an intervention, then the posterior is$\begin{array}{ll}P\left(\theta \rightx,y)& =\frac{P\left(y\right\theta ,x\left)P\right(x\left\theta \right)P\left(\theta \right)}{P\left(y\right\theta ,x\left)P\right(x\left\theta \right)P\left(\theta \right)+P\left(x\right\neg \theta ,y\left)P\right(y\neg \theta )P(\neg \theta )}\phantom{\rule{2em}{0ex}}\\ =\frac{\frac{3}{4}\xb7\frac{1}{2}\xb7\frac{1}{2}}{\frac{3}{4}\xb7\frac{1}{2}\xb7\frac{1}{2}+\frac{3}{4}\xb7\frac{1}{2}\xb7\frac{1}{2}}=\frac{1}{2}=P\left(\theta \right),\phantom{\rule{2em}{0ex}}\end{array}$
that is, the agent doesn’t learn anything just from observing. This also highlights the importance of interventions (Box [1966]).
Discussion
The main contribution of the present paper is to show in how far generalized Thompson sampling can be regarded as an optimal solution method for adaptive decisionmaking in the presence of informationprocessing constraints and how this framework can be extended to solve problems of causal induction. We previously proposed Equation (3) as a Bayesian rule for acting in (Ortega and Braun [2010a, 2010b]) that optimally solves the adaptive coding problem for actions and observations. In practice, it is implemented by sampling an environment parameter ${\widehat{\theta}}_{t}$ for each time step from the posterior distribution $P\left(\theta \right{\xe2}_{<t},{o}_{<t})$, and then treating it as if it was the true parameter—that is, issuing the action a_{ t } from $P\left({a}_{t}\right{\widehat{\theta}}_{t},{a}_{<t},{o}_{<t})$. This actionsampling method where beliefs are randomly instantiated was first proposed as a heuristic in (Thompson [1933]) and is now known as Thompson sampling. Importantly, this method can be generalized and applied to solve general sequential adaptive decisionmaking problems.
So far Thompson sampling has been mainly applied to multiarmed bandit problems. Multiarmed bandits can be represented by a parameter θ that summarizes the statistical properties of the reward obtained for each lever. Reward distributions range from Bernoulli to Gaussian (with unknown mean and variance), and they can also depend on the particular context or state (Graepel et al. [2010]; May and Leslie [2011]; Granmo [2010]; Scott [2010]). In particular, the work of (May and Leslie [2011]) and the work of (Granmo [2010]) prove asymptotic convergence of Thompson sampling. The performance of bandit algorithms has also been studied in terms of the rate of growth of the regret (Lai and Robbins [1995]), and recent bandit algorithms have been shown to match this lower bound (Cappé et al. [2013]), including Thompson sampling algorithms for Bernoulli bandits (Kaufmann et al. [2012]). Also, the work of (Chapelle and Li [2011]) presents empirical results that show Thompson sampling is highly competitive, matching or outperforming popular methods such as UCB (Lai and Robbins [1995]; Auer et al. [2002]).
Another class of problems, where Thompson sampling has been applied in the past, are Markov decision processes (MDPs). MDPs can be represented by parameterizing the dynamics and reward distribution (modelbased) (Strens [2000]) or by directly parameterizing the Qtable (modelfree) (Dearden et al. [1998]; Ortega and Braun [2010a]). The first approach samples a full description of an MDP, solves it for the optimal policy, and then issues the optimal action. This is repeated in each time step. The second approach avoids the computational overhead of solving for the optimal policy in each time step by directly doing inference on the Qtables. Actions are chosen by picking the one having the highest Qvalue for the current state. The same ideas can also be applied to solve adaptive control problems with linear system equations, quadratic cost functions and Gaussian noise (Braun and Ortega [2010]).
Optimality
 1.
Computational complexity. The computations required to find the optimal solution (for instance, the computational complexity of solving the Bellman optimality equations) are prohibitive in general and scale exponentially with the length of the horizon. The problem is tractable only in very special cases under assumptions that reduce the effective size of the problem.
 2.
Causal precedence of policy choice. The choice of the policy has to be made before the interaction with the environment starts. That is, an agent has to have a unique optimal policy before it has even interacted once with the environment. An optimal policy constructed by the maximum expected utility principle is therefore a very risky bet, as a lot of resources have to be spent before any evidence exists that the underlying model or prior is adequate.
Because of these two reasons, it is practically often impossible to apply the maximum expected utility principle. This has led to the development of theories of bounded rational decisionmaking that take the information processing limitations of decisionmakers into account. The modern study of bounded rationality was famously broached by Simon ([1956, 1972, 1984]) and has since been extensively investigated in psychology (Gigerenzer and Selten [2001]; Camerer [2003]), cognitive science (Howes et al. [2009]; Janssen et al. [2011]; Lewis et al.), economics (Aumann [1997]; Rubinstein [1998]; Kahneman [2003]), game theory (McKelvey and Palfrey [1995, 1998]; Wolpert [2004]), political science (Jones [2003]), industrial organization (Spiegler [2011]), computer science and artificial intelligence research (Lipman [1995]; Russell [1995]; Russell and Subramanian [1995]). Different conceptions of bounded rationality are divided as to whether bounded rational behavior is thought to be fundamentally nonoptimizing or whether it can be expressed as a (constrained) optimization problem and as to whether it involves any kind of metareasoning (Klein [2001]). While the variational formulation in the free energy can also be thought of as a constrained optimization problem, this optimization is only implicit in an agent that runs an anytime algorithm to obtain samples that directly optimize the original (unconstrained) utility function. The average number of samples that can be afforded is determined by an inverse temperature parameter, such that the search for the optimum is aborted after some time, thereby generating some kind of satisficing solution. The free energy formulation of bounded rationality also allows reinterpreting a wider research program that has investigated relative entropy as a particular cost function for control (Kappen [2005]; Todorov [2006, 2009]; Theodorou et al. [2010]; Peters et al. [2010]; Braun and Ortega [2011]; Kappen et al. [2012]) and has inspired the formulation of optimal control problems as inference problems (Tishby and Polani [2011]; Kappen et al. [2012]; Rawlik et al. [2012]). In Section “Decisionmaking with limited resources” we have argued that Thompson sampling can be regarded as an instantiation of free energy optimizing bounded rationality requiring the minimal amount of samples of the latent variable θ in the decisionmaking process determining the next action. An agent that follows such a Thompson sampling strategy randomly samples beliefs θ and acts optimally with respect to these random beliefs. In contrast, a perfectly rational agent optimizes his utility over the entire belief tree.
Policy Uncertainty. Given a problem specification in terms of the predictive model and the utility function, we can think about policy uncertainty in terms of policy search methods. The task of a policy search method is to calculate a policy that approximates the optimal policy. More specifically, let π be a parameter in a set Π indexing the set of candidate policies P(a_{ t }π,a_{1:t1},o_{1:t1}) indexed by θ∈Θ. Then, in the most general case, a policy search method returns a probability distribution P(π) over Π representing the uncertainty over the optimal policy parameters. If the algorithm solves the maximum expected utility problem, then the support of this distribution will exclusively cover the set of optimal policies Π^{∗}⊂Π. Otherwise there remains uncertainty over the optimal policy parameters. However, many policy search methods do not explicitly deal with the uncertainty over the policy parameters. Some methods only return a point estimate $\widehat{\pi}\in \Pi $. For instance, reinforcement learning algorithms (Sutton and Barto [1998]) start from a randomly initialized point estimate ${\widehat{\pi}}_{0}$ of the optimal policy and then generate refined point estimates ${\widehat{\pi}}_{1},{\widehat{\pi}}_{2},{\widehat{\pi}}_{3},\dots $ in each time step t=1,2,3,… using the data provided by experience. In order to converge to the optimal policy, these algorithms have to deal with the explorationexploitation tradeoff. This means that the agents cannot just greedily act according to these point estimates; instead, they have to produce explorative actions as well, that is, actions that deviate from the current estimate of the optimal policy—for instance producing optimistic actions based on UCB (Lai and Robbins [1995]; Auer et al. [2002]).
Crucially, when sampling actions from the predictive distribution, the policy index π is identical to the index θ that identifies a particular environment with the likelihood model P(o_{ t }a_{1:t1},o_{1:t1}). By turning the reinforcement learning problem thus into an inference problem, the explorationexploitation tradeoff becomes a biasvariance tradeoff (Geman et al. [1992]) in policy space. This highlights the essence of the explorationexploitation tradeoff: any action issued by the agent has to respect the uncertainty over the policy parameter—otherwise they are biased. In particular, if the agent acts deterministically and greedily (i.e. it treats the estimate $\widehat{\pi}$ as if it were the true policy parameter) then it is overfitting the experience and introducing a bias; likewise, an agent that follows a stochastic policy introduces variance and will not produce the highest possible reward compared to the case when the optimal policy is known. An excessively stochastic agent therefore underfits its experience.
The operational distinction of having policy uncertainty has important algorithmic consequences. When there is policy uncertainty, the belief of the decisionmaker is itself a random variable. This means that the very policy is undefined until the random variable is resolved. Hence, the computation of the optimal policy can be delayed and determined dynamically. It is precisely this fact that is (implicitly) exploited in popular reinforcement learning algorithms, and explicitly in the algorithms based on random beliefs. This is in stark contrast to the case when there is no policy uncertainty, where the policy is precomputed and static. Another example where random beliefs play a crucial role is in games with incomplete information (Osborne and Rubinstein [1999]). Here, having incomplete information about the other player leads to a infinite hierarchy of metareasoning about the other player’s strategy. To avoid this difficulty, Harsanyi introduced Bayesian games (Harsanyi [1967]). In a Bayesian game, incomplete knowledge is modeled by randomly instantiating the player’s types, after which they choose their strategies optimally—thus eliminating the need for recurrent reasoning about the other players’ strategy. Similarly, a Thompson sampling agent randomly instantiates his belief at every point in time and acts optimally with respect to this belief. An important consequence of this is that agents have uncertainty about their policy.
In the case of observations, this is a wellknown variational principle for Bayesian inference, as it describes a predictor that requires, on average, the least amount of extra bits to capture informational surprise stemming from the behavior of the environment. In the case of actions, the same principle can be harnessed to describe resourceful generation of actions in a way that requires random bits with minimum length on average, when trying to match the optimal policy most suitable for the unknown environment (MacKay [2003]). When thinking about the adaptive control problem in this way, the aim of the adaptive agent is simply to avoid surprise. The fact that each custombuilt policy $P\left({a}_{t}\right\theta ,{\xe2}_{<t},{o}_{<t})$ can be thought of as maximizing a utility in environment Q_{ θ } is not crucial, as this policy could also be given by a teacher’s demonstration in the absence of an explicitly stated utility function. The avoidance of surprise of adaptive systems has recently been discussed in the context of active inference and the free energy principle (Friston [2009, 2010]).
Causality
In Section “Causal induction”, we could demonstrate that generalized Thompson sampling can also be applied to the problem of causal induction, by designing policy and prediction models with different causal structures. This way generalized Thompson sampling can be used as a general method for causal induction that is Bayesian in nature. It is based on the idea of combining probability trees (Shafer [1996]) with interventions (Pearl [2000]) for predicting the behavior of a manipulated system with multiple causal hypotheses. Both the interventions and the constraints on the causal hypotheses introduce statistical asymmetries that permit the extraction of causal information. Unlike frameworks that aim to extract causal information from observational data alone (Shimizu et al. [2006]; Griffiths and Tenenbaum [2009]; Janzing and Schölkopf [2010]), the proposed method is designed for agents that interact with their environment and use these interactions to discover causal relationships.
i.e. the actions have to be issued deterministically (but possibly historydependent) from a unique policy. Intuitively speaking, this is because the operations of intervening and conditioning coincide when the random variables are deterministic.
Convergence
There are important cases where random belief approaches can fail. Indeed, it is easy to devise experiments where having policy uncertainty converges exponentially slower (or does not converge at all) than the Bayes adaptive optimal policy. Consider, for example, two korder Markov chains with only one observable state when applying k times the same action, but we do not know which action it is. For two possible actions and a uniform prior over the two possible environments the distribution over possible worlds stays uniform as long as no reward has been observed. Choosing actions randomly according to this distribution would require 2^{ k } actions to accidentally choose a sequence of the same action of length k. Thus, the Bayes adaptive optimal policy converges in time k, while the agent with policy uncertainty needs exponentially longer. A simple way to remedy this problem is, of course, to sample random beliefs only every k time steps (Strens [2000]). But this problem can be exacerbated in nonstationary environments. Take for instance, an increasing MDP with two actions and number of states $k=\lceil 10\sqrt{t}\phantom{\rule{0.3em}{0ex}}\rceil $, in which the optimal policy converges in 100 steps, while an agent with policy uncertainty would not converge at all in most realizations. Although (Ortega and Braun [2010b]) prove asymptotic convergence for general environments fulfilling a restrictive form of ergodicity condition, this condition needs to be weakened for the convergence proof to be applicable to most real problems. But it is clear that a form of ergodicity is required for an agent with policy uncertainty to be able to learn to act optimally. Intuitively, this means that an agent can only learn if the environment has temporally stable statistical properties.
Conclusion
In this paper we have argued that Thompson sampling is a bounded rational strategy in decisionmaking that can be considered optimal under given information processing constraints. Thompson sampling agents have uncertainty over their policy, which is a natural phenomenon that arises whenever there are not enough computational resources to apply the maximum expected utility principle to single out a unique optimal policy. Having policy uncertainty effectively weakens the two assumptions of the maximum expected utility principle: the optimal policy can be chosen and refined during interactions, and the computational complexity is lower. We have shown that treating this uncertainty in a Bayesian way with actions as random variables that obey causal calculus naturally leads to Thompson sampling and its Bayesian generalization. This generalized Thompson sampling can be straightforwardly applied to the problem of causal induction. Maintaining and updating Bayesian probabilities is an optimally efficient way to deal with uncertainty—be it with respect to the policy or the environment (Ortega and Braun [2010a]). As these randombelief approaches can be derived simply from probability theory and causal calculus we suggest that they cannot only be regarded as heuristic approximations to optimal decisionmaking, but as principled solution methods in their own right.
Endnote
^{a} Each custombuilt policy P(a_{ t }θ,a_{<t},o_{<t}) can be thought to maximize a utility function in its environment θ, but this is not essential—the policy could also just be given by a teacher’s demonstration as in imitation learning (Schaal [1999]).
Notes
Declarations
Acknowledgements
This study was supported by the DFG, Emmy Noether grant BR4164/11.
Authors’ Affiliations
References
 Agrawal S, Goyal N: Analysis of Thompson sampling for the multiarmed bandit problem. JMLR: Workshop and Conference Proceedings vol 23 (2012) 39.1–39.26. 25th Annual Conference on Learning Theory 2011.Google Scholar
 Asmuth J, Li L, Littman ML, Nouri A, Wingate D: A Bayesian s+ in reinforcement learning. In Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence. UAI ’09, Arlington, Virginia, United States: AUAI Press; 2009:19–26.Google Scholar
 Auer P, CesaBianchi N, Fisher P: Finitetime analysis of the multiarmed bandit problem. Machine Learning 2002, 47: 235–256. 10.1023/A:1013689704352View ArticleMATHGoogle Scholar
 Aumann RJ: Rationality and bounded rationality. Games and Econ Behavior 1997, 21(1–2):2–14. 10.1006/game.1997.0585MathSciNetView ArticleMATHGoogle Scholar
 Box G: Use and abuse of regression. Technometrics 1966, 8(4):625–629. 10.2307/1266635MathSciNetView ArticleMATHGoogle Scholar
 Braun DA, Ortega PA: A minimum relative entropy principle for adaptive control in linear quadratic regulators. The 7th Conference on Informatics in Control, Automation and Robotics, Volume 3 2010, 103–108.Google Scholar
 Braun DA, Ortega PA, Theodorou E, Schaal S: Path integral control and bounded rationality. IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning 2011, 202–209.Google Scholar
 Bubeck S, Liu CY: A note on the Bayesian regret of Thompson sampling with an arbitrary prior. 2013. arXiv:1304.5758Google Scholar
 Camerer C: Behavioral Game Theory: Experiments in Strategic Interaction. Princeton: Princeton University Press; 2003.MATHGoogle Scholar
 Cao F, Ray S: Bayesian hierarchical reinforcement learning. Neural Information Processing Systems 25 (NIPS) 2012.Google Scholar
 Cappé O, Garivier A, Maillard OA, Munos R, Stoltz G: KullbackLeibler upper confidence bounds for optimal sequential allocation. Ann Stat 2013, 41(3):1516–1541. 10.1214/13AOS1119MathSciNetView ArticleMATHGoogle Scholar
 Chapelle O, Li L: An empirical evaluation of Thompson sampling. NIPS 2011, 2249–2257.Google Scholar
 Dearden R, Friedman N, Russell S: Bayesian Qlearning. In AAAI ’98/IAAI ’98: Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence. Menlo Park, CA, US: American Association for Artificial Intelligence; 1998:761–768.Google Scholar
 Dimitrakakis C: MonteCarlo utility estimates for Bayesian reinforcement learning. IEEE Conference on Decision and Control 2013.Google Scholar
 Dimitrakakis C, Tziortziotis N: ABC reinforcement learning. Proceedings of The 30th International Conference on Machine Learning 2013, 684–692.Google Scholar
 Duff M: Optimal learning: computational procedures for bayesadaptive markov decision processes. PhD thesis. 2002. [DirectorAndrew Barto] [DirectorAndrew Barto]Google Scholar
 Friston K: The freeenergy principle: a rough guide to the brain? Trends in Cognitive Science 2009, 13: 293–301. 10.1016/j.tics.2009.04.005View ArticleGoogle Scholar
 Friston K: The freeenergy principle: a unified brain theory? Nat Rev Neurosci 2010, 11: 127–138. 10.1038/nrn2787View ArticleGoogle Scholar
 Geman S, Bienenstock E, Doursat R: Neural networks and the bias/variance dilemma. Neural Comput 1992, 4: 1–58. 10.1162/neco.1992.4.1.1View ArticleGoogle Scholar
 Gigerenzer G, Selten R: Bounded Rationality: The Adaptive Toolbox. Cambridge, MA: MIT Press; 2001.Google Scholar
 Gittins J: Bandit processes and dynamic allocation indices. J R Stat Soc Ser B, Methodological 1979, 41: 148–177.MathSciNetMATHGoogle Scholar
 Glymour C, Spirtes P, Scheines R: Causation, Prediction, and Search, 2nd edition. Cambridge, Massachusetts, USA: MIT Press; 2000.MATHGoogle Scholar
 Graepel T, Quiñonero Candela J, Borchert T, Herbrich R: Webscale Bayesian clickthrough rate prediction for sponsored search advertising in Microsoft’s Bing search engine. Proceedings of the TwentySeventh International Conference on Machine Learning 2010, 25–26.Google Scholar
 Granmo OC: A Bayesian learning automaton for solving twoarmed bernoulli bandit problems. Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications: ICMLA ’08 2008, 23–30.View ArticleGoogle Scholar
 Granmo OC: Solving twoarmed Bernoulli bandit problems using a Bayesian learning automaton. Int J Intell Comput Cybernetics 2010, 3(2):207–234. 10.1108/17563781011049179MathSciNetView ArticleMATHGoogle Scholar
 Granmo OC, Glimsdal S: Accelerated Bayesian learning for decentralized twoarmed bandit based decision making with applications to the Goore game. Applied intelligence 2013, 38(4):479–488. 10.1007/s104890120346zView ArticleGoogle Scholar
 Griffiths TL, Tenenbaum JB: Theorybased causal induction. Psychological Rev 2009, 116: 661–716.View ArticleGoogle Scholar
 Harsanyi J: Games with incomplete information played by “Bayesian” players. Management Sci 1967, 14(3):159–182. 10.1287/mnsc.14.3.159MathSciNetView ArticleMATHGoogle Scholar
 Heckerman D, Meek C, Cooper G: A Bayesian approach to causal discovery. Computation, causation, and discovery 1999, 19: 141–166.MathSciNetGoogle Scholar
 Howes A, Lewis RL, Vera A: Rational adaptation under task and processing constraints: implications for testing theories of cognition and action. Psychological Rev 2009, 116(4):717–751.View ArticleGoogle Scholar
 Hutter M: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Berlin: Springer; 2004.MATHGoogle Scholar
 Janssen CP, Brumby DP, Dowell J, Chater N, Howes A: Identifying optimum performance tradeoffs using a cognitively bounded rational analysis model of discretionary task interleaving. Topics in Cognitive Sci 2011, 3: 123–139. 10.1111/j.17568765.2010.01125.xView ArticleGoogle Scholar
 Janzing D, Schölkopf B: Causal inference using the algorithmic Markov condition. IEEE Trans Inf Theor 2010, 56(10):5168–5194.MathSciNetView ArticleGoogle Scholar
 Jaynes E: Entropy and search theory. In Maximum entropy and Bayesian methods in inverse problems. Heidelberg: SpringerVerlag; 1985.Google Scholar
 Jones BD: Bounded rationality and political science: lessons from public administration and public policy. J Public Administration Res Theory 2003, 13(4):395–412. 10.1093/jopart/mug028View ArticleGoogle Scholar
 Kahneman D: Maps of bounded rationality: psychology for behavioral economics. Am Econ Rev 2003, 93(5):1449–1475. 10.1257/000282803322655392View ArticleGoogle Scholar
 Kappen H: A linear theory for control of nonlinear stochastic systems. Phys Rev Lett 2005, 95: 200201.MathSciNetView ArticleGoogle Scholar
 Kappen H, Gómez V, Opper M: Optimal control as a graphical model inference problem. Machine Learn 2012, 1: 1–11.MathSciNetMATHGoogle Scholar
 Kaufmann E, Korda N, Munos R: Thompson sampling: an asymptotically optimal finitetime analysis. In ALT, Volume 7568 of, Lecture Notes in Computer Science. Edited by: Bshouty NH, Stoltz G, Vayatis N, Zeugmann T. Heidelberg, Germany: Springer; 2012:199–213.Google Scholar
 Keller G: Equilibrium States in Ergodic Theory. London Mathematical Society Student Texts: Cambridge Univeristy Press; 1998.View ArticleMATHGoogle Scholar
 Klein G: The fiction of optimization. In Bounded rationality: The adaptive toolbox. Edited by: Gigerenzer G, Selten R. Cambridge, Massachusetts, USA: MIT Press; 2001.Google Scholar
 Korda N, Kaufmann E, Munos R: Thompson sampling for 1dimensional exponential family bandits. Advances in Neural Information Processing Systems 2013, 1448–1456.Google Scholar
 Lai T, Robbins H: Asymptotically efficient adaptive allocation rules. Adv Appl Math 1995, 6: 4–22.MathSciNetView ArticleMATHGoogle Scholar
 Legg S: Machine super intelligence. PhD thesis, Department of Informatics, University of Lugano 2008.Google Scholar
 Lewis R, Howes A, Singh S: Computational rationality: linking mechanism and behavior through bounded utility maximization. Topics in Cognitive Science 2014, (in press) (in press)Google Scholar
 Lipman B: Information processing and bounded rationality: a survey. Canadian J Econ 1995, 28: 42–67. 10.2307/136022View ArticleGoogle Scholar
 MacKay D: Information Theory, Inference, and Learning Algorithms. Cambridge, UK: Cambridge University Press; 2003.MATHGoogle Scholar
 Martin J: Bayesian Decision Problems and Markov Chains. Publications in Operations Research, Wiley; 1967.MATHGoogle Scholar
 May B, Leslie D: Simulation studies in optimistic Bayesian sampling in contextualbandit problems. In Technical Report 11:02. Statistics Group, Department of Mathematics. Bristol, UK: University of Bristol; 2011.Google Scholar
 May BC, Korda N, Lee A, Leslie DS: Optimistic Bayesian sampling in contextualbandit problems. J Mach Learn Res 2012, 98888: 2069–2106.MathSciNetMATHGoogle Scholar
 Mckelvey R, Palfrey TR: Quantal response equilibria for extensive form games. Experimental Econ 1998, 1: 9–41.View ArticleMATHGoogle Scholar
 McKelvey RD, Palfrey TR: Quantal response equilibria for normal form games. Games and Econ Behavior 1995, 10: 6–38. 10.1006/game.1995.1023MathSciNetView ArticleMATHGoogle Scholar
 Mellor J, Shapiro J: Thompson sampling in switching environments with Bayesian online change point detection. 2013. arXiv:1302.3721Google Scholar
 Ortega PA: A unified framework for resourcebounded autonomous agents interacting with unknown environments. PhD thesis, Department of Engineering, University of Cambridge, UK 2011a.Google Scholar
 Ortega PA: Bayesian causal induction. NIPS Workshop on Philosophy and Machine Learning, Granada 2011.Google Scholar
 Ortega PA, Braun DA: A Bayesian rule for adaptive control based on causal interventions. In Proceedings of the third conference on general artificial intelligence. Paris, France: Atlantis Press; 2010a.Google Scholar
 Ortega PA, Braun DA: A minimum relative entropy principle for learning and acting. J Artif Intell Res 2010b, 38: 475–511.MathSciNetGoogle Scholar
 Ortega PA, Braun DA: A conversion between utility and information. In Proceedings of the Third Conference on Artificial General Intelligence. Paris, France: Atlantis Press; 2010c:115–120.Google Scholar
 Ortega PA, Braun DA: Information, utility and bounded rationality. In Lecture notes on artificial intelligence, Volume 6830. Heidelberg, Germany: SpringerVerlag; 2011:269–274.Google Scholar
 Ortega PA, Braun DA: Free energy and the generalized optimality equations for sequential decision making. In European Workshop for Reinforcement Learning. Edinburgh, UK; 2012a.Google Scholar
 Ortega PA, Braun DA: Adaptive coding of actions and observations. NIPS Workshop on Information in Perception and Action 2012b.Google Scholar
 Ortega PA, Braun DA: Thermodynamics as a theory of decisionmaking with informationprocessing costs. Proc R Soc A: Mathematical, Physical and Engineering Science 2013, 469: 2153.MathSciNetView ArticleGoogle Scholar
 Osband I, Russo D, Roy BV: (More) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems 2013, 3003–3011.Google Scholar
 Osborne MJ, Rubinstein A: A Course in Game Theory. Cambridge, Massachusetts, USA: MIT Press; 1999.MATHGoogle Scholar
 Pearl J: Causality: Models, Reasoning, and Inference. Cambridge, UK: Cambridge University Press; 2000.MATHGoogle Scholar
 Peters J, Mülling K, Altun Y: Relative entropy policy search. AAAI 2010.Google Scholar
 Rawlik K, Toussaint M, Vijayakumar S: On stochastic optimal control and reinforcement learning by approximate inference. In Proceedings of Robotics: Science and Systems. Sydney, Australia: ; 2012.Google Scholar
 Rubinstein A: Modeling Bounded Rationality. Cambridge, Massachusetts, USA: MIT Press; 1998.Google Scholar
 Russell S: Rationality and Intelligence. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. Edited by: Mellish C. Englewood Cliffs, New Jersey, USA: PrenticeHall; 1995:950–957.Google Scholar
 Russell S, Norvig P: Artificial Intelligence: A Modern Approach, 1st edition. PrenticeHall: Englewood Cliffs, NJ; 1995.MATHGoogle Scholar
 Russell S, Subramanian D: Provably boundedoptimal agents. J Artif Intell Res 1995, 3: 575–609.MATHGoogle Scholar
 Russo D, Roy BV: Learning to optimize via posterior sampling. 2013. arXiv:abs/1301.2609Google Scholar
 Schaal S: Is imitation learning the route to humanoid robots? Trends in cognitive sciences 1999, 3(6):233–242. 10.1016/S13646613(99)013273View ArticleGoogle Scholar
 Scott S: A modern Bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry 2010, 26: 639–658. 10.1002/asmb.874MathSciNetView ArticleGoogle Scholar
 Shafer G: The Art of Causal Conjecture. Cambridge, Massachusetts, USA: MIT Press; 1996.MATHGoogle Scholar
 Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A: A Linear NonGaussian Acyclic Model for Causal Discovery. J Mach Learn Res 2006, 7: 2003–2030.MathSciNetMATHGoogle Scholar
 Simon HA: Rational choice and the structure of the environment. Psychological Rev 1956, 63(2):129–138.View ArticleGoogle Scholar
 Simon HA: Theories of bounded rationality. In Decision and Organization. Edited by: McGuire CB, Radner R. Amsterdam: NorthHolland Publishing; 1972:161–176.Google Scholar
 Simon H A: Models of Bounded Rationality. Cambridge. Cambridge, Massachusetts, USA: MIT Press; 1984.Google Scholar
 Spiegler R: Bounded Rationality and Industrial Organization. Oxford: Oxford University Press; 2011.View ArticleGoogle Scholar
 Stone L: Theory of Optimal Search. New York: Academic Press; 1998.Google Scholar
 Strens M: A Bayesian framework for reinforcement learning. Proceedings of the Seventeenth International Conference on Machine Learning 2000.Google Scholar
 Sutton R, Barto A: Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press; 1998.Google Scholar
 Theodorou E, Buchli J, Schaal S: A generalized path integral approach to reinforcement learning. J Mach Learn Res 2010, 11: 3137–3181.MathSciNetMATHGoogle Scholar
 Thompson WR: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 1933, 25(3/4):285–294. 10.2307/2332286View ArticleMATHGoogle Scholar
 Tishby N, Polani D: Information theory of decisions and actions. In Perceptionreasonaction cycle: Models, algorithms and systems. Edited by: Vassilis T Hussain, Vassilis T Hussain . Heidelberg: SpringerVerlag; 2011:601–636.View ArticleGoogle Scholar
 Todorov E: Linearly solvable Markov decision problems. Advances in Neural Information Processing Systems, Volume 19 2006, 1369–1376.Google Scholar
 Todorov E: Efficient computation of optimal actions. Proceedings of the National Academy of Sciences USA 2009, 106: 11478–11483. 10.1073/pnas.0710743106View ArticleMATHGoogle Scholar
 Tziortziotis N, Dimitrakakis C, Blekas K: Cover tree Bayesian reinforcement learning. 2013a. arXiv: 1305.1809Google Scholar
 Tziortziotis N, Dimitrakakis C, Blekas K: Linear Bayesian reinforcement learning. In Proceedings of the TwentyThird international joint conference on Artificial Intelligence. AAAI Press; 2013:1721–1728.Google Scholar
 Vitanyi P: Time, space, and energy in reversible computing. Proceedings of the 2nd ACM conference on Computing frontiers 2005, 435–444.Google Scholar
 Wolpert DH: Information theory  the bridge connecting bounded rational game theory and statistical physics. In Complex Engineering Systems. New York, USA: Perseus Books; 2004.Google Scholar
 Wyatt J: Exploration and inference in learning from reinforcement. PhD thesis, Department of Artificial Intelligence, University of Edinburgh 1997.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.