Generalized Thompson sampling for sequential decision-making and causal inference
© Ortega and Braun; licensee Springer. 2014
Received: 11 November 2013
Accepted: 17 February 2014
Published: 14 March 2014
The Erratum to this article has been published in Complex Adaptive Systems Modeling 2014 2:4
Sampling an action according to the probability that the action is believed to be the optimal one is sometimes called Thompson sampling.
Although mostly applied to bandit problems, Thompson sampling can also be used to solve sequential adaptive control problems, when the optimal policy is known for each possible environment. The predictive distribution over actions can then be constructed by a Bayesian superposition of the policies weighted by their posterior probability of being optimal.
Here we discuss two important features of this approach. First, we show in how far such generalized Thompson sampling can be regarded as an optimal strategy under limited information processing capabilities that constrain the sampling complexity of the decision-making process. Second, we show how such Thompson sampling can be extended to solve causal inference problems when interacting with an environment in a sequential fashion.
In summary, our results suggest that Thompson sampling might not merely be a useful heuristic, but a principled method to address problems of adaptive sequential decision-making and causal inference.
In a research paper from 1933, Thompson studied the problem of finding out which one of two drugs was better when testing them on a patient population under the constraint that as few people as possible should be subjected to the inferior drug during the course of testing (Thompson ). Given a current (Bayesian) probability estimate P of one treatment being better than the other, he suggested that it might be a good idea to adjust the proportions of future test subjects that take the two drugs to the respective probabilities P and 1-P. This way one would not run into the danger of permanently cutting off all future test subjects from a potentially superior treatment that so far seems inferior due to statistical fluctuations, while only temporarily risking exposure to a potentially inferior drug for a smaller proportion of the population. Randomizing actions based on the probability that this action is believed to be optimal when faced with an unknown environment is now sometimes called Thompson sampling.
Today, Thompson’s problem is generally thought of as a bandit problem that consists in determining which lever to pull at which point in time when facing a set of one-armed slot machines, each one having an unknown distribution over a reward variable (Russell and Norvig ; Sutton and Barto ). In the case of known prior probabilities and geometrically discounted future rewards, Gittins () provides an optimal policy for the bandit problem that maximizes the expected future cumulative discounted reward. In contrast, Thompson sampling is usually considered as a heuristic approach to solve bandit problems (Wyatt ; Granmo [2008, 2010]; Asmuth et al. ; Graepel et al. ; Scott ; May and Leslie ; Chapelle and Li ; Agrawal and Goyal ; Granmo and Glimsdal ; May et al. ; Kaufmann et al. ; Russo and Roy ; Korda et al. ; Bubeck and Liu ). However, the basic idea of Thompson sampling—that is, sampling actions from a mixture distribution of policies according to their probability of being optimal–can also be applied to solve more general problems in sequential adaptive control (Dearden et al. ; Strens ; Ortega and Braun [2010a, 2010b, 2012a]; Braun and Ortega ; Osband and Russo ; Cao and Ray ; Tziortziotis et al. [2013a, 2013];x Dimitrakakis ; Dimitrakakis and Tziortziotis ; Mellor and Shapiro ).
Thompson sampling can be considered as an application of Bayes’ rule for acting where actions are treated as causally intervened random variables within the framework of statistical causality.
Thompson sampling can be considered as a form of optimal adaptive control under bounded rationality where limited information processing capabilities are modeled by entropic search costs.
Thompson sampling provides a natural strategy for causal induction when interacting with an environment with unknown causal structure.
Although the third section contains an algorithmic extension to previous work (Ortega and Braun [2010a]; [2010b]), it should be emphasized that the main contribution of the paper is not so much to present a novel algorithm, but to discuss basic properties of Thompson sampling, in particular how it relates to the information-theoretic bounded rationality model in (Ortega and Braun ), how this boundedness can be interpreted in terms of sampling complexity, and how this method can be applied to solve problems of causal inference.
The paper is structured as follows. In Section “Problem statement” we clarify the problem statement and recapitulate the main result of (Ortega and Braun [2010b]). In Section “Decision-making with limited resources” we analyze the decision-making problem faced by agents that are unable to compute the single best policy. In Section “Causal induction” we investigate how this approach can be applied to adaptive agents that need to discover the causal structure of their environment. Finally, we discuss the significance of these results in Section “Discussion”.
In an adaptive control problem a decision-maker faces an environment Q θ drawn from a set of potential environments . In general θ could be a continuous variable, but we restrict our exposition to the discrete case. Each environment Q θ can be characterized by a set of conditional distributions Q(o t |θ,a≤t,o<t) that indicate the probability of observing o t given past observations o<t=o1…ot-1 and past actions a≤t=a1…a t . This class of environments is very general, and it encompasses multi-armed bandits, (partially observable) Markov decision processes and others –compare Chapter 3 (Legg ). To allow for self-optimizing agents, the environment is typically assumed to be ergodic, so agents can recover from their mistakes –compare Section 3.5 (Legg ). The decision-maker has perfectly fitting prediction models P(o t |θ,a≤t,o<t)=Q(o t |θ,a≤t,o<t), but is uncertain about θ. The uncertainty about θ can be represented by a prior distribution P(θ). The interaction proceeds as follows. First an environment θ is sampled from P(θ). The agent picks an action a0 and receives an observation o0, to which the agent responds with a1 and receives observation o1 etc. The agent’s policy can be described by a set of conditional distributions P(a t |o<t,a<t).
Problem statement: decision-theory
In order to solve the problem within the framework of maximum expected utility theory, one requires
a prior P(θ) over possible environments Q θ
a class of prediction models P(o t |θ,a≤t,o<t)
a utility function U(o≤T,a≤T).
maximizes the expected utility under the mixture distribution (Hutter ). Equations (1) and (2) define a Bayesian adaptive control problem (Martin ; Duff ). This problem formulation becomes quickly intractable, as the number of reachable information states grows exponentially in the time horizon (Duff ).
Problem statement: probability theory & statistical causality
Ignoring the notion of utility for a moment and treating actions purely as (causally intervened) random variables (Pearl ), one could think of another kind of adaptive control problem that is defined entirely in probabilistic and causal terms. This requires the following ingredients
a prior P(θ) over possible environments Q θ
a class of prediction models P(o t |θ,a≤t,o<t)
a class of policy models P(a t |θ,a<t,o<t)
As there can be only one action at any one time, single actions can be obtained as samples from . Importantly, sampling from is equivalent to first sampling a random belief θ from the posterior and then sampling an action from . This componentwise sampling from a mixture distribution is known as hierarchical sampling and corresponds here to a generalized Thompson sampling procedure, where first a random belief is sampled and then the associated policy with respect to this belief is executed. If we assume now that each of the custom-built policies is optimal in their respective environments, we effectively select an action according to the probability that it is the optimal action, because we first sample the environment θ according to its posterior probability of being the true environment and then we perform the policy that is optimal in that environment. The question is how this problem formulation can be reconciled with a decision-theoretic problem statement that involves utilities. This is the topic of Section “Decision-making with limited resources”.
While both actions and observations are treated as random variables in (3), there is an important difference between actions and observations. Observations are produced by the environment and can be used to update the agent’s state of knowledge about the environment. In contrast, actions are set by the agent itself and hence they do not provide information about the environment. This distinction becomes crucial when conditioning on the history of actions and observations. The theory that deals with the distinction between exogenous and endogenous information is statistical causality (Pearl ; Glymour et al. ).
What is a causal intervention?
where w and b are the weather and the Barometer variables respectively, P(w) is the prior probability of the weather (e.g. good or bad) and P(b|w) is the likelihood of the barometer change given the weather. The posterior P(w|b) allows us to infer the weather from the barometer reading.
Now, imagine you decide to change the level of the mercury yourself, say (using a bit of imagination) by means of a pressurizing device. Now, you set the value of the random variable—and intuition tells us that we cannot predict the weather anymore from the barometer reading. Apparently, our previous Bayesian model is useless now. This shouldn’t come as a surprise, as our intervention effectively changed the relation between the barometer and the weather.
In other words, we don’t gain knowledge about the weather—as expected. Notice that intervening the alternative factorization, , would give a different result that is inconsistent with our causal story: we have assumed that the mercury level of the barometer depends functionally on the weather, and not the other way around. The reason for this special treatment of actions is that when we set the value of a random variable ourselves, we change Nature’s probability law.
Causal interventions in Thompson sampling
where we first expand the probabilities in terms of the joint distribution, second rewrite the joint distribution as the causal factorization, third remove the intervention tags from the intervened random variables that are in the probability conditions (Pearl’s second rule of do-calculus (Pearl )), and fourth replace each conditional probability having an intervened variable in the argument by a delta function over its chosen value—compare Chapter 4.2 in (Pearl ).
obtained only by applying probability theory and causal calculus.
Decision-making with limited resources
respectively, where the labels (a) and (b) correspond to the labels in Figure 1. Here it is clearly seen that the difference between the two lies in the order in which we apply the expectation (over the environment parameter) and the maximization operator. It should also be noted that the expected utility of (a) is an upper bound on the expected utility of (b). Yet, both cases can constitute optimal decisions depending on constraints. In (a), the decision-maker picks his action taking into account the uncertainty over the bias, while in (b), the decision-maker picks his action only after his beliefs over the coin bias are instantiated—that is, he is optimal w.r.t. his random beliefs. Here we consider how this optimality w.r.t. random beliefs can be considered as a form of optimal decision-making under information processing constraints.
Modeling bounded rational decision-making
Here we consider a particular information-theoretic model of bounded rational decision-making that formalizes limited information processing resources by a variational principle that trades off expected utility gains (or losses) and entropic information costs (Ortega [2011a]; Ortega and Braun [2011, 2012a, 2013]). Information processing costs are usually ignored in the study of perfectly rational decision-makers. Given a choice set with choices and utilities U(x), a perfectly rational decision-maker would always choose the best option x∗= arg maxx U(x)—presupposing there is a unique maximum. In general, a bounded rational decision-maker is unable to pick out the best option with certainty, and his choice can be described by a probability distribution P(x) reflecting uncertainty. Improving the choice strategy P(x) can be understood as a costly search process.
Let us assume the initial strategy of the decision-maker can be described by a probability distribution P0(x). The search process for the optimum transforms this initial choice into a final choice P(x). In case of the perfectly rational decision-maker the final choice is . In the general case of the bounded rational decision-maker the search is costly and he will not be able to afford such a stark reduction in uncertainty. Assuming that search costs are real-valued, additive and higher for rare events (Ortega and Braun [2010c]), it can be shown that the cost of the search is determined by the information distance D K L between P0 and P, that is . Both Bayesian search (Jaynes ) and Koopman’s random search (Stone ) are compatible with these assumptions, as well as energetic costs that would have to be paid by a Maxwellian demon for reducing uncertainty in statistical physical systems (Ortega and Braun ). How this information-theoretic model of search costs relates to computational resources such as space and time complexity is still an open problem (Vitanyi ).
where is the uniform distribution over the maximizing subset . It can be seen that a perfectly rational agent with α→∞ is able to pick out the optimal action—which is a deterministic policy in the case of a single optimum—, whereas finitely rational agents have stochastic policies with non-zero probability of picking a sub-optimal action.
because , that is the exponential of the Kullback-Leibler divergence provides a lower bound on the average number of samples.
Decisions in the presence of latent variables
To model a Thompson sampling agent, we need at least a two-step decision with a variable x that has to be chosen by the agent, and a variable θ that is chosen by the environment. In the example described in Figure 1, the variable x is the agent’s prediction for the outcome of a coin toss, the variable θ indicates nature’s choice which one of the two coins is tossed. The agent’s prediction can take on the values x=H and x=T corresponding to the outcomes Head and Tail. The variable θ takes on the two values and corresponding to the biases of the two coins. The prior probability over θ is and . The expected rewards for all combinations of x and θ are then , , and .
and the normalization constant From Equation (17) we can derive both the perfectly rational decision-maker and the Thompson sampling agent. To simplify, we assume in the following that the agent has no prior preference for x, that is .
The resulting agent is a probabilistic superposition of agents that act optimally for any given θ as depicted in Figure 1b. It can be seen that in Equation (19) and in Equation (18) the order of the expectation operation and the (soft-)maximization operation are reversed.
with and T as set above. Since α≫β we can assume α≈N β with , and we can see easily that the perfectly rational agent will require infinitely many θ samples (α→∞ and β→0) to obtain one sample of x, whereas the Thompson sampling agent will only require a single sample (α=β). The Thompson sampling agent is therefore the agent that can solve the optimization problem of Equation (16) for a given α with the least amount of samples. This can also be seen from Equation (18), when doing the Monte Carlo approximation by drawing N samples θ i ∼p0(θ i ). For infinitely many samples, the average approximates the expectation, for a single sample we can rewrite Equation (18) into Equation (19). This sampling procedure also allows estimating the upper and lower bounds of the optimal utility (Tziortziotis et al. ). Of course, the Thompson sampling agent will not achieve the same expected utility as the perfectly rational agent. But both agents can be considered optimal under particular information processing constraints.
A generalized Thompson-sampling agent can be thought of as a probabilistic superposition of models θ, where each model θ is characterized by a likelihood model P(o t |θ,a≤t,o<t) and a policy model P(a t |θ,a<t,o<t). In previous applications we assumed that all models θ have the same causal structure, i.e. considering multivariate random variables a t and o t , we assumed that the same variables a t are intervened for all θ and the same causal model is used to predict the consequences of these interventions on the observational variables o t . However, this need not be the case. In principle, different models θ could represent different causal structures and suggest intervention of different variables. Such a setup can be used for causal induction as illustrated in the following example.
Imagine we are working on a medical treatment that involves two gene sites X and Y, each of which can be active or inactive. We encode the 'on’ and 'off’ states of X as X=x and X=¬x and similarly Y=y and Y=¬y to denote the 'on’ and 'off’ states of Y. Assume we are unsure about the causal mechanism between the two variables, that is we are unsure whether the activity of X influences the activity of Y or the other way around. Formally, we are interested in the explanatory power of two competing causal hypotheses: either 'X causes Y’ (Θ=θ) or 'Y causes X’ (Θ=¬θ). Assume our goal is to have Y in an active state, but that it is much cheaper and easier to manipulate X instead of Y. This leaves us with the following policies. If X causes Y we prefer to manipulate X, because it is cheap and easy. If Y causes X we have no other choice, but to directly manipulate Y. When manipulating either gene, we can be 100% sure that the new state of the gene is set by us, but we only have a 50% chance that the state will be 'on’. Assume not manipulating either variable is not an option, because then both gene sites stay inactive. The question is how should we act if we do not know the causal dependency?
Causal precedence: A node causally precedes its descendants. For instance, the root node corresponding to the sure event Ω causally precedes all other nodes.
Resolution of variables: Each node resolves the value of a random variable. For instance, given the node corresponding to Θ=θ and X=¬x, either Y=y will happen with probability or Y=¬y with probability .
Heterogeneous order: The resolution order of random variables can vary across different branches. For instance, X precedes Y under Θ=θ, but Y precedes X under Θ=¬θ. This is precisely how we model competing causal hypotheses.
While the probability tree represents the agent’s subjective model explaining the order in which the random values are resolved, it does not necessarily correspond to the temporal order in which the events are revealed to us. So for instance, under hypothesis Θ=θ, the value of the variable Y might be revealed before X, even though X causally precedes Y; and the causal hypothesis Θ, which precedes both X and Y, is never observed.
- 1.Manipulating X: First, the agent instantiates his random beliefs by sampling the value of Θ from the prior:Assume that the result is θ. Treating θ as if it was the true parameter, he proceeds to sample the action from P(X|θ) given byas indicated in the left branch of the probability tree. Assume that outcome is x, and this is the action that the agent executes. Because of this, the agent has to update its beliefs first by intervening the probability tree for and second by conditioning on x. The intervention is carried out by replacing all the nodes in the tree that resolve the value of X with new nodes assigning probability one to x and zero to ¬x. Figure 2b illustrates the result of this intervention. The posterior is then given by
In other words, the agent has switched on X, and has so far learned nothing from this fact.
- 2.Observing Y: Now, the agent observes the activity of Y, and assume that it is active, i.e. Y=y. Then, the posterior beliefs of the agent are given asSince , the agent has gathered evidence favoring the hypothesis “X causes Y”. This was only possible because the intervention introduced a statistical asymmetry among the two hypotheses that did not exist in the beginning. In comparison, if the action is not treated as an intervention, then the posterior is
that is, the agent doesn’t learn anything just from observing. This also highlights the importance of interventions (Box ).
The main contribution of the present paper is to show in how far generalized Thompson sampling can be regarded as an optimal solution method for adaptive decision-making in the presence of information-processing constraints and how this framework can be extended to solve problems of causal induction. We previously proposed Equation (3) as a Bayesian rule for acting in (Ortega and Braun [2010a, 2010b]) that optimally solves the adaptive coding problem for actions and observations. In practice, it is implemented by sampling an environment parameter for each time step from the posterior distribution , and then treating it as if it was the true parameter—that is, issuing the action a t from . This action-sampling method where beliefs are randomly instantiated was first proposed as a heuristic in (Thompson ) and is now known as Thompson sampling. Importantly, this method can be generalized and applied to solve general sequential adaptive decision-making problems.
So far Thompson sampling has been mainly applied to multi-armed bandit problems. Multi-armed bandits can be represented by a parameter θ that summarizes the statistical properties of the reward obtained for each lever. Reward distributions range from Bernoulli to Gaussian (with unknown mean and variance), and they can also depend on the particular context or state (Graepel et al. ; May and Leslie ; Granmo ; Scott ). In particular, the work of (May and Leslie ) and the work of (Granmo ) prove asymptotic convergence of Thompson sampling. The performance of bandit algorithms has also been studied in terms of the rate of growth of the regret (Lai and Robbins ), and recent bandit algorithms have been shown to match this lower bound (Cappé et al. ), including Thompson sampling algorithms for Bernoulli bandits (Kaufmann et al. ). Also, the work of (Chapelle and Li ) presents empirical results that show Thompson sampling is highly competitive, matching or outperforming popular methods such as UCB (Lai and Robbins ; Auer et al. ).
Another class of problems, where Thompson sampling has been applied in the past, are Markov decision processes (MDPs). MDPs can be represented by parameterizing the dynamics and reward distribution (model-based) (Strens ) or by directly parameterizing the Q-table (model-free) (Dearden et al. ; Ortega and Braun [2010a]). The first approach samples a full description of an MDP, solves it for the optimal policy, and then issues the optimal action. This is repeated in each time step. The second approach avoids the computational overhead of solving for the optimal policy in each time step by directly doing inference on the Q-tables. Actions are chosen by picking the one having the highest Q-value for the current state. The same ideas can also be applied to solve adaptive control problems with linear system equations, quadratic cost functions and Gaussian noise (Braun and Ortega ).
Computational complexity. The computations required to find the optimal solution (for instance, the computational complexity of solving the Bellman optimality equations) are prohibitive in general and scale exponentially with the length of the horizon. The problem is tractable only in very special cases under assumptions that reduce the effective size of the problem.
Causal precedence of policy choice. The choice of the policy has to be made before the interaction with the environment starts. That is, an agent has to have a unique optimal policy before it has even interacted once with the environment. An optimal policy constructed by the maximum expected utility principle is therefore a very risky bet, as a lot of resources have to be spent before any evidence exists that the underlying model or prior is adequate.
Because of these two reasons, it is practically often impossible to apply the maximum expected utility principle. This has led to the development of theories of bounded rational decision-making that take the information processing limitations of decision-makers into account. The modern study of bounded rationality was famously broached by Simon ([1956, 1972, 1984]) and has since been extensively investigated in psychology (Gigerenzer and Selten ; Camerer ), cognitive science (Howes et al. ; Janssen et al. ; Lewis et al.), economics (Aumann ; Rubinstein ; Kahneman ), game theory (McKelvey and Palfrey [1995, 1998]; Wolpert ), political science (Jones ), industrial organization (Spiegler ), computer science and artificial intelligence research (Lipman ; Russell ; Russell and Subramanian ). Different conceptions of bounded rationality are divided as to whether bounded rational behavior is thought to be fundamentally non-optimizing or whether it can be expressed as a (constrained) optimization problem and as to whether it involves any kind of meta-reasoning (Klein ). While the variational formulation in the free energy can also be thought of as a constrained optimization problem, this optimization is only implicit in an agent that runs an anytime algorithm to obtain samples that directly optimize the original (unconstrained) utility function. The average number of samples that can be afforded is determined by an inverse temperature parameter, such that the search for the optimum is aborted after some time, thereby generating some kind of satisficing solution. The free energy formulation of bounded rationality also allows reinterpreting a wider research program that has investigated relative entropy as a particular cost function for control (Kappen ; Todorov [2006, 2009]; Theodorou et al. ; Peters et al. ; Braun and Ortega ; Kappen et al. ) and has inspired the formulation of optimal control problems as inference problems (Tishby and Polani ; Kappen et al. ; Rawlik et al. ). In Section “Decision-making with limited resources” we have argued that Thompson sampling can be regarded as an instantiation of free energy optimizing bounded rationality requiring the minimal amount of samples of the latent variable θ in the decision-making process determining the next action. An agent that follows such a Thompson sampling strategy randomly samples beliefs θ and acts optimally with respect to these random beliefs. In contrast, a perfectly rational agent optimizes his utility over the entire belief tree.
Policy Uncertainty. Given a problem specification in terms of the predictive model and the utility function, we can think about policy uncertainty in terms of policy search methods. The task of a policy search method is to calculate a policy that approximates the optimal policy. More specifically, let π be a parameter in a set Π indexing the set of candidate policies P(a t |π,a1:t-1,o1:t-1) indexed by θ∈Θ. Then, in the most general case, a policy search method returns a probability distribution P(π) over Π representing the uncertainty over the optimal policy parameters. If the algorithm solves the maximum expected utility problem, then the support of this distribution will exclusively cover the set of optimal policies Π∗⊂Π. Otherwise there remains uncertainty over the optimal policy parameters. However, many policy search methods do not explicitly deal with the uncertainty over the policy parameters. Some methods only return a point estimate . For instance, reinforcement learning algorithms (Sutton and Barto ) start from a randomly initialized point estimate of the optimal policy and then generate refined point estimates in each time step t=1,2,3,… using the data provided by experience. In order to converge to the optimal policy, these algorithms have to deal with the exploration-exploitation trade-off. This means that the agents cannot just greedily act according to these point estimates; instead, they have to produce explorative actions as well, that is, actions that deviate from the current estimate of the optimal policy—for instance producing optimistic actions based on UCB (Lai and Robbins ; Auer et al. ).
Crucially, when sampling actions from the predictive distribution, the policy index π is identical to the index θ that identifies a particular environment with the likelihood model P(o t |a1:t-1,o1:t-1). By turning the reinforcement learning problem thus into an inference problem, the exploration-exploitation trade-off becomes a bias-variance trade-off (Geman et al. ) in policy space. This highlights the essence of the exploration-exploitation trade-off: any action issued by the agent has to respect the uncertainty over the policy parameter—otherwise they are biased. In particular, if the agent acts deterministically and greedily (i.e. it treats the estimate as if it were the true policy parameter) then it is overfitting the experience and introducing a bias; likewise, an agent that follows a stochastic policy introduces variance and will not produce the highest possible reward compared to the case when the optimal policy is known. An excessively stochastic agent therefore underfits its experience.
The operational distinction of having policy uncertainty has important algorithmic consequences. When there is policy uncertainty, the belief of the decision-maker is itself a random variable. This means that the very policy is undefined until the random variable is resolved. Hence, the computation of the optimal policy can be delayed and determined dynamically. It is precisely this fact that is (implicitly) exploited in popular reinforcement learning algorithms, and explicitly in the algorithms based on random beliefs. This is in stark contrast to the case when there is no policy uncertainty, where the policy is pre-computed and static. Another example where random beliefs play a crucial role is in games with incomplete information (Osborne and Rubinstein ). Here, having incomplete information about the other player leads to a infinite hierarchy of meta-reasoning about the other player’s strategy. To avoid this difficulty, Harsanyi introduced Bayesian games (Harsanyi ). In a Bayesian game, incomplete knowledge is modeled by randomly instantiating the player’s types, after which they choose their strategies optimally—thus eliminating the need for recurrent reasoning about the other players’ strategy. Similarly, a Thompson sampling agent randomly instantiates his belief at every point in time and acts optimally with respect to this belief. An important consequence of this is that agents have uncertainty about their policy.
In the case of observations, this is a well-known variational principle for Bayesian inference, as it describes a predictor that requires, on average, the least amount of extra bits to capture informational surprise stemming from the behavior of the environment. In the case of actions, the same principle can be harnessed to describe resourceful generation of actions in a way that requires random bits with minimum length on average, when trying to match the optimal policy most suitable for the unknown environment (MacKay ). When thinking about the adaptive control problem in this way, the aim of the adaptive agent is simply to avoid surprise. The fact that each custom-built policy can be thought of as maximizing a utility in environment Q θ is not crucial, as this policy could also be given by a teacher’s demonstration in the absence of an explicitly stated utility function. The avoidance of surprise of adaptive systems has recently been discussed in the context of active inference and the free energy principle (Friston [2009, 2010]).
In Section “Causal induction”, we could demonstrate that generalized Thompson sampling can also be applied to the problem of causal induction, by designing policy and prediction models with different causal structures. This way generalized Thompson sampling can be used as a general method for causal induction that is Bayesian in nature. It is based on the idea of combining probability trees (Shafer ) with interventions (Pearl ) for predicting the behavior of a manipulated system with multiple causal hypotheses. Both the interventions and the constraints on the causal hypotheses introduce statistical asymmetries that permit the extraction of causal information. Unlike frameworks that aim to extract causal information from observational data alone (Shimizu et al. ; Griffiths and Tenenbaum ; Janzing and Schölkopf ), the proposed method is designed for agents that interact with their environment and use these interactions to discover causal relationships.
i.e. the actions have to be issued deterministically (but possibly history-dependent) from a unique policy. Intuitively speaking, this is because the operations of intervening and conditioning coincide when the random variables are deterministic.
There are important cases where random belief approaches can fail. Indeed, it is easy to devise experiments where having policy uncertainty converges exponentially slower (or does not converge at all) than the Bayes adaptive optimal policy. Consider, for example, two k-order Markov chains with only one observable state when applying k times the same action, but we do not know which action it is. For two possible actions and a uniform prior over the two possible environments the distribution over possible worlds stays uniform as long as no reward has been observed. Choosing actions randomly according to this distribution would require 2 k actions to accidentally choose a sequence of the same action of length k. Thus, the Bayes adaptive optimal policy converges in time k, while the agent with policy uncertainty needs exponentially longer. A simple way to remedy this problem is, of course, to sample random beliefs only every k time steps (Strens ). But this problem can be exacerbated in non-stationary environments. Take for instance, an increasing MDP with two actions and number of states , in which the optimal policy converges in 100 steps, while an agent with policy uncertainty would not converge at all in most realizations. Although (Ortega and Braun [2010b]) prove asymptotic convergence for general environments fulfilling a restrictive form of ergodicity condition, this condition needs to be weakened for the convergence proof to be applicable to most real problems. But it is clear that a form of ergodicity is required for an agent with policy uncertainty to be able to learn to act optimally. Intuitively, this means that an agent can only learn if the environment has temporally stable statistical properties.
In this paper we have argued that Thompson sampling is a bounded rational strategy in decision-making that can be considered optimal under given information processing constraints. Thompson sampling agents have uncertainty over their policy, which is a natural phenomenon that arises whenever there are not enough computational resources to apply the maximum expected utility principle to single out a unique optimal policy. Having policy uncertainty effectively weakens the two assumptions of the maximum expected utility principle: the optimal policy can be chosen and refined during interactions, and the computational complexity is lower. We have shown that treating this uncertainty in a Bayesian way with actions as random variables that obey causal calculus naturally leads to Thompson sampling and its Bayesian generalization. This generalized Thompson sampling can be straightforwardly applied to the problem of causal induction. Maintaining and updating Bayesian probabilities is an optimally efficient way to deal with uncertainty—be it with respect to the policy or the environment (Ortega and Braun [2010a]). As these random-belief approaches can be derived simply from probability theory and causal calculus we suggest that they cannot only be regarded as heuristic approximations to optimal decision-making, but as principled solution methods in their own right.
a Each custom-built policy P(a t |θ,a<t,o<t) can be thought to maximize a utility function in its environment θ, but this is not essential—the policy could also just be given by a teacher’s demonstration as in imitation learning (Schaal ).
This study was supported by the DFG, Emmy Noether grant BR4164/1-1.
- Agrawal S, Goyal N: Analysis of Thompson sampling for the multi-armed bandit problem. JMLR: Workshop and Conference Proceedings vol 23 (2012) 39.1–39.26. 25th Annual Conference on Learning Theory 2011.Google Scholar
- Asmuth J, Li L, Littman ML, Nouri A, Wingate D: A Bayesian s+ in reinforcement learning. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence. UAI ’09, Arlington, Virginia, United States: AUAI Press; 2009:19–26.Google Scholar
- Auer P, Cesa-Bianchi N, Fisher P: Finite-time analysis of the multiarmed bandit problem. Machine Learning 2002, 47: 235–256. 10.1023/A:1013689704352View ArticleMATHGoogle Scholar
- Aumann RJ: Rationality and bounded rationality. Games and Econ Behavior 1997, 21(1–2):2–14. 10.1006/game.1997.0585MathSciNetView ArticleMATHGoogle Scholar
- Box G: Use and abuse of regression. Technometrics 1966, 8(4):625–629. 10.2307/1266635MathSciNetView ArticleMATHGoogle Scholar
- Braun DA, Ortega PA: A minimum relative entropy principle for adaptive control in linear quadratic regulators. The 7th Conference on Informatics in Control, Automation and Robotics, Volume 3 2010, 103–108.Google Scholar
- Braun DA, Ortega PA, Theodorou E, Schaal S: Path integral control and bounded rationality. IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning 2011, 202–209.Google Scholar
- Bubeck S, Liu CY: A note on the Bayesian regret of Thompson sampling with an arbitrary prior. 2013. arXiv:1304.5758Google Scholar
- Camerer C: Behavioral Game Theory: Experiments in Strategic Interaction. Princeton: Princeton University Press; 2003.MATHGoogle Scholar
- Cao F, Ray S: Bayesian hierarchical reinforcement learning. Neural Information Processing Systems 25 (NIPS) 2012.Google Scholar
- Cappé O, Garivier A, Maillard OA, Munos R, Stoltz G: Kullback-Leibler upper confidence bounds for optimal sequential allocation. Ann Stat 2013, 41(3):1516–1541. 10.1214/13-AOS1119MathSciNetView ArticleMATHGoogle Scholar
- Chapelle O, Li L: An empirical evaluation of Thompson sampling. NIPS 2011, 2249–2257.Google Scholar
- Dearden R, Friedman N, Russell S: Bayesian Q-learning. In AAAI ’98/IAAI ’98: Proceedings of the fifteenth national/tenth conference on Artificial intelligence/Innovative applications of artificial intelligence. Menlo Park, CA, US: American Association for Artificial Intelligence; 1998:761–768.Google Scholar
- Dimitrakakis C: Monte-Carlo utility estimates for Bayesian reinforcement learning. IEEE Conference on Decision and Control 2013.Google Scholar
- Dimitrakakis C, Tziortziotis N: ABC reinforcement learning. Proceedings of The 30th International Conference on Machine Learning 2013, 684–692.Google Scholar
- Duff M: Optimal learning: computational procedures for bayes-adaptive markov decision processes. PhD thesis. 2002. [Director-Andrew Barto] [Director-Andrew Barto]Google Scholar
- Friston K: The free-energy principle: a rough guide to the brain? Trends in Cognitive Science 2009, 13: 293–301. 10.1016/j.tics.2009.04.005View ArticleGoogle Scholar
- Friston K: The free-energy principle: a unified brain theory? Nat Rev Neurosci 2010, 11: 127–138. 10.1038/nrn2787View ArticleGoogle Scholar
- Geman S, Bienenstock E, Doursat R: Neural networks and the bias/variance dilemma. Neural Comput 1992, 4: 1–58. 10.1162/neco.1918.104.22.168View ArticleGoogle Scholar
- Gigerenzer G, Selten R: Bounded Rationality: The Adaptive Toolbox. Cambridge, MA: MIT Press; 2001.Google Scholar
- Gittins J: Bandit processes and dynamic allocation indices. J R Stat Soc Ser B, Methodological 1979, 41: 148–177.MathSciNetMATHGoogle Scholar
- Glymour C, Spirtes P, Scheines R: Causation, Prediction, and Search, 2nd edition. Cambridge, Massachusetts, USA: MIT Press; 2000.MATHGoogle Scholar
- Graepel T, Quiñonero Candela J, Borchert T, Herbrich R: Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. Proceedings of the Twenty-Seventh International Conference on Machine Learning 2010, 25–26.Google Scholar
- Granmo OC: A Bayesian learning automaton for solving two-armed bernoulli bandit problems. Proceedings of the 2008 Seventh International Conference on Machine Learning and Applications: ICMLA ’08 2008, 23–30.View ArticleGoogle Scholar
- Granmo OC: Solving two-armed Bernoulli bandit problems using a Bayesian learning automaton. Int J Intell Comput Cybernetics 2010, 3(2):207–234. 10.1108/17563781011049179MathSciNetView ArticleMATHGoogle Scholar
- Granmo OC, Glimsdal S: Accelerated Bayesian learning for decentralized two-armed bandit based decision making with applications to the Goore game. Applied intelligence 2013, 38(4):479–488. 10.1007/s10489-012-0346-zView ArticleGoogle Scholar
- Griffiths TL, Tenenbaum JB: Theory-based causal induction. Psychological Rev 2009, 116: 661–716.View ArticleGoogle Scholar
- Harsanyi J: Games with incomplete information played by “Bayesian” players. Management Sci 1967, 14(3):159–182. 10.1287/mnsc.14.3.159MathSciNetView ArticleMATHGoogle Scholar
- Heckerman D, Meek C, Cooper G: A Bayesian approach to causal discovery. Computation, causation, and discovery 1999, 19: 141–166.MathSciNetGoogle Scholar
- Howes A, Lewis RL, Vera A: Rational adaptation under task and processing constraints: implications for testing theories of cognition and action. Psychological Rev 2009, 116(4):717–751.View ArticleGoogle Scholar
- Hutter M: Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Berlin: Springer; 2004.MATHGoogle Scholar
- Janssen CP, Brumby DP, Dowell J, Chater N, Howes A: Identifying optimum performance trade-offs using a cognitively bounded rational analysis model of discretionary task interleaving. Topics in Cognitive Sci 2011, 3: 123–139. 10.1111/j.1756-8765.2010.01125.xView ArticleGoogle Scholar
- Janzing D, Schölkopf B: Causal inference using the algorithmic Markov condition. IEEE Trans Inf Theor 2010, 56(10):5168–5194.MathSciNetView ArticleGoogle Scholar
- Jaynes E: Entropy and search theory. In Maximum entropy and Bayesian methods in inverse problems. Heidelberg: Springer-Verlag; 1985.Google Scholar
- Jones BD: Bounded rationality and political science: lessons from public administration and public policy. J Public Administration Res Theory 2003, 13(4):395–412. 10.1093/jopart/mug028View ArticleGoogle Scholar
- Kahneman D: Maps of bounded rationality: psychology for behavioral economics. Am Econ Rev 2003, 93(5):1449–1475. 10.1257/000282803322655392View ArticleGoogle Scholar
- Kappen H: A linear theory for control of non-linear stochastic systems. Phys Rev Lett 2005, 95: 200201.MathSciNetView ArticleGoogle Scholar
- Kappen H, Gómez V, Opper M: Optimal control as a graphical model inference problem. Machine Learn 2012, 1: 1–11.MathSciNetMATHGoogle Scholar
- Kaufmann E, Korda N, Munos R: Thompson sampling: an asymptotically optimal finite-time analysis. In ALT, Volume 7568 of, Lecture Notes in Computer Science. Edited by: Bshouty NH, Stoltz G, Vayatis N, Zeugmann T. Heidelberg, Germany: Springer; 2012:199–213.Google Scholar
- Keller G: Equilibrium States in Ergodic Theory. London Mathematical Society Student Texts: Cambridge Univeristy Press; 1998.View ArticleMATHGoogle Scholar
- Klein G: The fiction of optimization. In Bounded rationality: The adaptive toolbox. Edited by: Gigerenzer G, Selten R. Cambridge, Massachusetts, USA: MIT Press; 2001.Google Scholar
- Korda N, Kaufmann E, Munos R: Thompson sampling for 1-dimensional exponential family bandits. Advances in Neural Information Processing Systems 2013, 1448–1456.Google Scholar
- Lai T, Robbins H: Asymptotically efficient adaptive allocation rules. Adv Appl Math 1995, 6: 4–22.MathSciNetView ArticleMATHGoogle Scholar
- Legg S: Machine super intelligence. PhD thesis, Department of Informatics, University of Lugano 2008.Google Scholar
- Lewis R, Howes A, Singh S: Computational rationality: linking mechanism and behavior through bounded utility maximization. Topics in Cognitive Science 2014, (in press) (in press)Google Scholar
- Lipman B: Information processing and bounded rationality: a survey. Canadian J Econ 1995, 28: 42–67. 10.2307/136022View ArticleGoogle Scholar
- MacKay D: Information Theory, Inference, and Learning Algorithms. Cambridge, UK: Cambridge University Press; 2003.MATHGoogle Scholar
- Martin J: Bayesian Decision Problems and Markov Chains. Publications in Operations Research, Wiley; 1967.MATHGoogle Scholar
- May B, Leslie D: Simulation studies in optimistic Bayesian sampling in contextual-bandit problems. In Technical Report 11:02. Statistics Group, Department of Mathematics. Bristol, UK: University of Bristol; 2011.Google Scholar
- May BC, Korda N, Lee A, Leslie DS: Optimistic Bayesian sampling in contextual-bandit problems. J Mach Learn Res 2012, 98888: 2069–2106.MathSciNetMATHGoogle Scholar
- Mckelvey R, Palfrey TR: Quantal response equilibria for extensive form games. Experimental Econ 1998, 1: 9–41.View ArticleMATHGoogle Scholar
- McKelvey RD, Palfrey TR: Quantal response equilibria for normal form games. Games and Econ Behavior 1995, 10: 6–38. 10.1006/game.1995.1023MathSciNetView ArticleMATHGoogle Scholar
- Mellor J, Shapiro J: Thompson sampling in switching environments with Bayesian online change point detection. 2013. arXiv:1302.3721Google Scholar
- Ortega PA: A unified framework for resource-bounded autonomous agents interacting with unknown environments. PhD thesis, Department of Engineering, University of Cambridge, UK 2011a.Google Scholar
- Ortega PA: Bayesian causal induction. NIPS Workshop on Philosophy and Machine Learning, Granada 2011.Google Scholar
- Ortega PA, Braun DA: A Bayesian rule for adaptive control based on causal interventions. In Proceedings of the third conference on general artificial intelligence. Paris, France: Atlantis Press; 2010a.Google Scholar
- Ortega PA, Braun DA: A minimum relative entropy principle for learning and acting. J Artif Intell Res 2010b, 38: 475–511.MathSciNetGoogle Scholar
- Ortega PA, Braun DA: A conversion between utility and information. In Proceedings of the Third Conference on Artificial General Intelligence. Paris, France: Atlantis Press; 2010c:115–120.Google Scholar
- Ortega PA, Braun DA: Information, utility and bounded rationality. In Lecture notes on artificial intelligence, Volume 6830. Heidelberg, Germany: Springer-Verlag; 2011:269–274.Google Scholar
- Ortega PA, Braun DA: Free energy and the generalized optimality equations for sequential decision making. In European Workshop for Reinforcement Learning. Edinburgh, UK; 2012a.Google Scholar
- Ortega PA, Braun DA: Adaptive coding of actions and observations. NIPS Workshop on Information in Perception and Action 2012b.Google Scholar
- Ortega PA, Braun DA: Thermodynamics as a theory of decision-making with information-processing costs. Proc R Soc A: Mathematical, Physical and Engineering Science 2013, 469: 2153.MathSciNetView ArticleGoogle Scholar
- Osband I, Russo D, Roy BV: (More) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems 2013, 3003–3011.Google Scholar
- Osborne MJ, Rubinstein A: A Course in Game Theory. Cambridge, Massachusetts, USA: MIT Press; 1999.MATHGoogle Scholar
- Pearl J: Causality: Models, Reasoning, and Inference. Cambridge, UK: Cambridge University Press; 2000.MATHGoogle Scholar
- Peters J, Mülling K, Altun Y: Relative entropy policy search. AAAI 2010.Google Scholar
- Rawlik K, Toussaint M, Vijayakumar S: On stochastic optimal control and reinforcement learning by approximate inference. In Proceedings of Robotics: Science and Systems. Sydney, Australia: ; 2012.Google Scholar
- Rubinstein A: Modeling Bounded Rationality. Cambridge, Massachusetts, USA: MIT Press; 1998.Google Scholar
- Russell S: Rationality and Intelligence. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence. Edited by: Mellish C. Englewood Cliffs, New Jersey, USA: Prentice-Hall; 1995:950–957.Google Scholar
- Russell S, Norvig P: Artificial Intelligence: A Modern Approach, 1st edition. Prentice-Hall: Englewood Cliffs, NJ; 1995.MATHGoogle Scholar
- Russell S, Subramanian D: Provably bounded-optimal agents. J Artif Intell Res 1995, 3: 575–609.MATHGoogle Scholar
- Russo D, Roy BV: Learning to optimize via posterior sampling. 2013. arXiv:abs/1301.2609Google Scholar
- Schaal S: Is imitation learning the route to humanoid robots? Trends in cognitive sciences 1999, 3(6):233–242. 10.1016/S1364-6613(99)01327-3View ArticleGoogle Scholar
- Scott S: A modern Bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry 2010, 26: 639–658. 10.1002/asmb.874MathSciNetView ArticleGoogle Scholar
- Shafer G: The Art of Causal Conjecture. Cambridge, Massachusetts, USA: MIT Press; 1996.MATHGoogle Scholar
- Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A: A Linear Non-Gaussian Acyclic Model for Causal Discovery. J Mach Learn Res 2006, 7: 2003–2030.MathSciNetMATHGoogle Scholar
- Simon HA: Rational choice and the structure of the environment. Psychological Rev 1956, 63(2):129–138.View ArticleGoogle Scholar
- Simon HA: Theories of bounded rationality. In Decision and Organization. Edited by: McGuire CB, Radner R. Amsterdam: North-Holland Publishing; 1972:161–176.Google Scholar
- Simon H A: Models of Bounded Rationality. Cambridge. Cambridge, Massachusetts, USA: MIT Press; 1984.Google Scholar
- Spiegler R: Bounded Rationality and Industrial Organization. Oxford: Oxford University Press; 2011.View ArticleGoogle Scholar
- Stone L: Theory of Optimal Search. New York: Academic Press; 1998.Google Scholar
- Strens M: A Bayesian framework for reinforcement learning. Proceedings of the Seventeenth International Conference on Machine Learning 2000.Google Scholar
- Sutton R, Barto A: Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press; 1998.Google Scholar
- Theodorou E, Buchli J, Schaal S: A generalized path integral approach to reinforcement learning. J Mach Learn Res 2010, 11: 3137–3181.MathSciNetMATHGoogle Scholar
- Thompson WR: On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 1933, 25(3/4):285–294. 10.2307/2332286View ArticleMATHGoogle Scholar
- Tishby N, Polani D: Information theory of decisions and actions. In Perception-reason-action cycle: Models, algorithms and systems. Edited by: Vassilis T Hussain, Vassilis T Hussain . Heidelberg: Springer-Verlag; 2011:601–636.View ArticleGoogle Scholar
- Todorov E: Linearly solvable Markov decision problems. Advances in Neural Information Processing Systems, Volume 19 2006, 1369–1376.Google Scholar
- Todorov E: Efficient computation of optimal actions. Proceedings of the National Academy of Sciences USA 2009, 106: 11478–11483. 10.1073/pnas.0710743106View ArticleMATHGoogle Scholar
- Tziortziotis N, Dimitrakakis C, Blekas K: Cover tree Bayesian reinforcement learning. 2013a. arXiv: 1305.1809Google Scholar
- Tziortziotis N, Dimitrakakis C, Blekas K: Linear Bayesian reinforcement learning. In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence. AAAI Press; 2013:1721–1728.Google Scholar
- Vitanyi P: Time, space, and energy in reversible computing. Proceedings of the 2nd ACM conference on Computing frontiers 2005, 435–444.Google Scholar
- Wolpert DH: Information theory - the bridge connecting bounded rational game theory and statistical physics. In Complex Engineering Systems. New York, USA: Perseus Books; 2004.Google Scholar
- Wyatt J: Exploration and inference in learning from reinforcement. PhD thesis, Department of Artificial Intelligence, University of Edinburgh 1997.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.