Generalized Thompson Sampling for Sequential Decision-Making and Causal Inference

Recently, it has been shown how sampling actions from the predictive distribution over the optimal action-sometimes called Thompson sampling-can be applied to solve sequential adaptive control problems, when the optimal policy is known for each possible environment. The predictive distribution can then be constructed by a Bayesian superposition of the optimal policies weighted by their posterior probability that is updated by Bayesian inference and causal calculus. Here we discuss three important features of this approach. First, we discuss in how far such Thompson sampling can be regarded as a natural consequence of the Bayesian modeling of policy uncertainty. Second, we show how Thompson sampling can be used to study interactions between multiple adaptive agents, thus, opening up an avenue of game-theoretic analysis. Third, we show how Thompson sampling can be applied to infer causal relationships when interacting with an environment in a sequential fashion. In summary, our results suggest that Thompson sampling might not merely be a useful heuristic, but a principled method to address problems of adaptive sequential decision-making and causal inference.


Introduction
In a research paper from 1933, Thompson studied the problem of finding out which one of two drugs was better when testing them on a patient population under the constraint that as few people as possible should be subjected to the inferior drug during the course of testing [1]. Given a current probability estimate P of one treatment being better than the other, he suggested that it might be a good idea to adjust the proportions of future test subjects that take the two drugs to the respective probabilities P and 1 − P . This way one would not run into the danger of permanently cutting off all future test subjects from a potentially superior treatment that so far seems inferior due to statistical fluctuations, while only temporarily risking exposure to a potentially inferior drug for a smaller proportion of the population. Randomizing actions based on the probability that this action is believed to be optimal when faced with an unknown environment is now sometimes called Thompson sampling.
Thompson sampling is a form of probability matching. Probability matching has been extensively studied in both humans and animals when they make predictions in stochastic environments [2,3]. Rather than consistently predicting the most likely outcome, experimental subjects tend to randomize their predictions based on the probabilities with which the respective events occur. When knowing the probabilities, this is clearly a suboptimal strategy. However, in the case of Thompson sampling it is important to note that the probabilities are not known. Nevertheless, one might argue that Thompson sampling is a suboptimal strategy, as Thompson's problem can be thought of as a bandit problem [4], which is solved optimally by Gittin's indices in the case of known prior probabilities and discounted rewards [5]. Most studies have therefore examined Thompson sampling as a heuristic in the context of bandit problems [6,7,8,9,10,11,12,13,14,15,16].
Recently, it was shown, however, that Thompson sampling can also be applied to solve a more general class of sequential adaptive control problems, provided that both an optimal policy and a predictive model is known for each possible environment [17]. When an environment is drawn randomly from the set of possible environments, the optimal policy can then be inferred on the fly by an adaptation process that is driven by actions sampled from the predictive distribution over the optimal policies. Here we study three characteristic features of such generalized Thompson sampling. First, we discuss in how far Thompson sampling can be regarded as a natural consequence of a Bayesian treatment of policy uncertainty. Second, we study convergence behavior when two adaptive Thompson sampling agents are coupled in a sequential fashion. Third, we show how this approach can be extended naturally to address the problem of causal induction when interacting with an unknown environment.
The paper is structured as follows. In Section 2 we clarify the problem statement and recapitulate the main result of [17]. In Section 3 we analyze the uncertainty faced by agents that are unable to compute the single best policy. In Section 4 we study interactions that arise when coupling two adaptive agents that employ Thompson sampling to determine their actions. In Section 5 we investigate how this approach can be applied to adaptive agents that need to discover the causal structure of their environment. Finally, we discuss in what sense Thompson sampling might provide a principled solution to adaptive decision-making problems.

Preliminaries
We restrict the exposition to the case of discrete time with discrete stochastic observations and actions. Let O and A be two finite sets, the first being the set of observations and the second being the set of actions. We use a ≤t := a 1 a 2 . . . a t and a <t := a 1 a 2 . . . a t−1 to simplify the notation of strings. We define the set of interactions as Z := A × O. The set of interaction strings of length t ≥ 0 is denoted by Z t . The set of all finite interaction strings is Z * := t≥0 Z t , the set of infinite strings is Z ∞ := {w : w = a 1 o 1 a 2 o 2 . . .}. The interaction string of length 0 is denoted by ǫ.
Agents and environments are formalized as I/O systems. An I/O system Pr is a probability distribution over interaction sequences Z ∞ . Pr is uniquely determined by the conditional probabilities Pr(a t |a <t , o <t ), Pr(o t |a ≤t , o <t ) for each a 1 o 1 . . . a t−1 o t−1 a t ∈ Z * . An interaction system (P, Q) is a coupling of two I/O systems, where P is an agent and Q is an environment. Because the agent and the environment mutually influence each other, their actions and observations are conditioned by the previous interactions. Accordingly, the probability of an interaction string a 1 o 1 . . . a T o T is given by From the point of view of the agent P , the distribution P (a t |a <t , o <t ) is a policy and captures the probability of producing action a t ∈ A given history is the agent's predictive model of the environment, as it predicts the probability of the observation o t ∈ O given history a 1 o 1 . . . a t−1 o t−1 a t . For the agent P , the sequence o 1 o 2 . . . provides its input stream and the sequence a 1 a 2 . . . is its output stream. In the case of the environment Q the roles are reversed, that is the sequence o 1 o 2 . . . is its output stream and the sequence a 1 a 2 . . . provides its input stream. The quintessential goal is to choose the agent's policy such that the resulting distribution over the interaction sequences (1) is desirable.

Policy Construction: Known Environment
If Q is known, then P can be equipped with a model that can perfectly predict its environment, that is P (o t |a ≤t , o <t ) = Q(o t |a ≤t , o <t ) for all a 1 o 1 . . . a t ∈ Z * . Moreover, a custom-made policy can be designed for P that produces desirable interaction sequences. Desirability is typically formalized by the economic theory of subjective expected utility (SEU) [18,19], which stipulates that a decision maker's preferences over lotteries can be thought of as maximizing a SEU of the outcome. In the policy construction setting, this translates into the designer having a real-valued utility function giving rise to utilities U (a 1:T , o 1:T ) for each realization, and the predictive model P (o t |a ≤t , o <t ). The utility function quantifies the subjective desirability of a particular interaction string and the probabilities represent the subjective model of the environment. The maximum expected utility principle then states that the designer has to choose the policy such that it maximizes the expected utility where the probabilities P (a ≤T , o ≤T ) are the policy-prediction products ( The optimal policy is often computed by restating the problem recursively and then using dynamic programming to solve the Belmann optimality equations [20]. The policy and the prediction model are both subjective in the sense that they are unilaterally chosen by the designer. A policy choice explainable by this scheme is defined to be a rational choice. Choices that do not strictly obey the maximum SEU principle are irrational, or at best bounded rational [21].

Policy Construction: Unknown Environment
In general, the prediction model will not be equal to the generative law of the environment, that is, and consequently the true expected utility is in general not equal to the SEU: One of the most interesting cases where the prediction model and the true generative law of the environment do not match is when the designer is uncertain about the latter. Formally, the designer expresses his uncertainty by introducing a random variable θ that indexes the class of potential environments Q θ . More specifically, he has a class of prediction models and policies such that for every possible environment indexed by θ there is a perfectly fitting predictor B(o t |θ, a ≤t , o <t ) = Q θ (o t |a ≤t , o <t ) and a desirable custom-built policy B(a t |θ, a <t , o <t ). Moreover, the designer believes that Q θ is drawn with probability B(θ) from a set Θ of possible environments before the interaction starts, where Θ is assumed to be discrete for simplicity.

Decision-theoretic Problem Formulation
In order to stay within the framework of subjective expected utility one has to reduce the problem of the unknown environment to a problem with known environment. Such a "known" environment can be created from a set of possible environments as a new "super-environment" by marginalizing over the parameter of the possible environments, thus, obtaining the Bayesian mixture distribution [22] The adaptive control problem is then solved by equating the prediction model P with the Bayesian predictive distribution over the observations, and then choosing the policy that maximizes the SEU as in the case of a known environment. This procedure effectively enlarges the space of possible environments to the convex hull H(Θ) spanned by the prediction models in Θ, i.e. an "environment" is any convex combination of distributions indexed by θ. Each θ ∈ H(Θ) thus corresponds to a Bayesian mixture over the environments in Θ, but with a different prior. If the true generative law coincides with one of the models, i.e. there is a θ * ∈ Θ such that then the predictive model will converge to the true generative law with Prprobability one, that is as t → ∞. Since the policy choice is optimal, this scheme directly bypasses the exploration-exploitation dilemma [23]. The important point, however, is that in order to express the uncertainty over the environment's probability law, the designer had to introduce a belief model that compiles into an actual prediction model. Thus, the policy construction in the SEU problem statement for unknown environments proceeds in two steps: first, a Bayesian mixture environment is created, and second a utility-maximizing policy is found on this mixture environment.

Probabilistic Problem Formulation
An alternative formulation of the problem statement for unknown environments is a one-step procedure that essentially stays within (Bayesian) probability calculus. In this case actions are treated as random variables and our goal is to determine a distribution P (a t |â <t , o <t ) that tells us how to act depending on past actionsâ <t and past observations o <t . The distribution P (a t |â <t , o <t ) is the pendant to the observational distribution P (o t |â ≤t , o <t ) used for prediction. The only caveat is that past probabilistic actions, unlike past observations, have to be marked as causal interventions-denoted by• in causal calculus [24].
Given the models (4), both distributions can then be expressed as mixture distributions where the posterior is given by .
Sampling from P (a t |â <t , o <t ) is equivalent to sampling a random belief θ from the posterior B(θ|â <t , o <t ) and then acting according to B(a t |θ, a <t , o <t ). This corresponds to a generalized Thompson sampling procedure, where first a random belief is sampled and then the optimal policy with respect to this belief is executed. Effectively, the posterior is also the only place where causal calculus plays a role, as can be seen by the absence of the likelihood B(â t−1 |θ, a <t−1 , o <t−1 ), which is equal to one. Intuitively, the reason for this is that the agent can be surprised about his past observations and learn from them, but he cannot be surprised about his own actions chosen by himself in the past. Past actions do not provide any information about the environment. As will be explained in more detail below, causal calculus deals exactly with inference problems when some random variables are intervened or set by the inference machine itself. Importantly, this result is obtained solely by applying basic probability and causal calculus.

Policy Uncertainty
While subjective expected utility is formally appealing as a principle for the construction of adaptive agents, its strict application is in practice often problematic. This is mainly due to two reasons: 1. Computational complexity. The computations required to find the optimal solution (for instance, the computational complexity of solving the Bellman optimality equations) are prohibitive in general and scale exponentially with the length of the horizon: the time complexity of the search algorithm is O(|A × O| T ). The problem is tractable only in very special cases under assumptions that reduce the effective size of the problem.
2. Causal precedence of policy choice. The choice of the policy has to be made before the interaction with the environment starts. That is, an agent has to have a unique optimal policy before it has even interacted once with the environment. An optimal policy constructed by the maximum SEU principle is therefore a very risky bet, as a lot of resources have to be spent before any evidence exists that the underlying model or prior is adequate.
Because of these two reasons, it is practically often impossible or questionable to apply the maximum SEU principle. In the following, we investigate how to weaken the formal assumptions of the policy construction method.

Policy search
Given a problem specification in terms of the predictive model and the utility function, the task of a policy search method is to calculate a policy that approximates the optimal policy.
More specifically, let π be a parameter in a set Π indexing the set of candidate policies B(a t |π, a <t , o <t ) (6) analogous to the prediction models (4) indexed by θ ∈ Θ. Then, in the most general case, a policy search method returns a probability distribution B(π) over Π representing the uncertainty over the optimal policy parameters. If the algorithm solves the maximum SEU problem, then the support of this distribution will exclusively cover the set of optimal policies Π * ⊂ Π. Otherwise there remains uncertainty over the optimal policy parameters. Policies can also be parameterized in terms of the predictive model. In particular, we will assume that for each θ ∈ Θ there is a known optimal policy π ∈ Π, such that one can construct a function b : Θ → Π that maps each θ into some π ∈ Π. Uncertainty over the environment can then directly be translated into policy uncertainty, such that any point in the convex hull H(Θ) can be mapped to a corresponding point in the convex hull H(Π) spanned by the policies π ∈ Π.

The Exploration-Exploitation Trade-Off
Many policy search methods do not explicitly deal with the uncertainty over the policy parameters. Some methods only return a point estimateπ ∈ Π. It is obvious that the greedy usage of the estimateπ leads to sub-optimal performance, since for allπ that are not in the set Π * of optimal policies, one has that is the SEU with respect to the policy parameter π ∈ Π. For instance, reinforcement learning algorithms [4] start from a randomly initialized point estimatê π 0 of the optimal policy and then generate refined point estimatesπ 1 ,π 2 ,π 3 , . . . in each time step t = 1, 2, 3, . . . using the data provided by experience. In order to converge to the optimal policy, these algorithms have to deal with the exploration-exploitation trade-off. This means that the agents cannot just greedily act according to these point estimates; instead, they have to produce explorative actions as well, that is, actions that deviate from the current estimate of the optimal policy-for instance producing optimistic actions based on UCB [25,26]. Let B t (π) denote the posterior distribution over the optimal policy at time t. Then, Hence, this shows that finding the optimal policy amounts to finding the pre-image of π * , such that the distribution over the policy space becomes the delta function where θ * ∈ Θ is the true prediction model defined in (5). This highlights the essence of the exploration-exploitation trade-off: any action issued by the agent has to respect the uncertainty over the policy parameter-otherwise they are biased. In particular, if the agent acts greedily (i.e. it treats the estimateπ as if it were the true policy parameter) then it is overfitting the experience; likewise, an agent having excessive uncertainty is underfitting. From a frequentist point of view, this reveals that the exploration-exploitation trade-off is nothing else but the bias-variance trade-off [27] in policy space. This suggests that just like Bayesian modeling naturally balances the bias-variance trade-off by creating estimators that are probability distributions instead of point estimates, Bayesian modeling of the exploration-exploitation trade-off leads to a Bayes-causal solution for generalized Thompson sampling.

Bayes-Causal Solution
It is important to note that the concept of the bias is conditional on the true parameter-which is unknown when the designer is uncertain about the environment. This is not a problem from a Bayesian point of view, because the best estimator of the policy parameter is its posterior distribution [28]. Hence, instead of dealing with the exploration-exploitation dilemma by introducing explorative actions, one can directly use the posterior distribution over the policy parameter as an estimate. To see how to do this, note that, by virtue of the mapping π = b(θ), the policies are independent of the policy parameter when the environment parameter is known: Hence, each θ ∈ Θ indexes a dynamical model given by the distributions over interaction sequences Given the dynamical models and their prior probabilities, the designer can form the Bayesian mixture model where the sum spans all the parameters in Θ. The mixture models in H(Θ) \ Θ need not be considered here, since it is assumed that θ * ∈ Θ.

Actions as Causal Interventions
The designer can directly use the probabilistic model (7) to characterize an agent with policy uncertainty. There is a caveat though when actions are treated as random variables. It is clear that the observations produced by the environment update the agent's state of knowledge about the environment. However, the actions are set by the agent itself and hence they do not provide information about the environment. The theory that deals with the distinction between exogenous and endogenous information is statistical causality [29,30]. Observations change the information state by regular Bayesian conditioning, whereas actions constitute causal interventions followed by Bayesian conditioning. To calculate the effect of an intervention, the causal model, i.e. the unique factorization of the joint distribution into conditional probabilities matching the causal dependencies over the random variables, is required to be known. In our setup, this is straightforward: first, the environment secretly chooses a true parameter θ * ∈ Θ, and then the interactions a 1 , o 1 , a 2 , o 2 , . . . follow chronologically.
Formally, this means that the posterior probabilities over the environment parameters are given by B(θ|â <t , o <t ) rather than the more familiar expression B(θ|a <t , o <t ), where the "hat"-notation a t denotes a causal intervention [31]. For our needs, it is enough to consider the following simple method to calculate the effect of causal interventions: 1. Expand the probabilities in terms of the joint distribution.
2. Rewrite the joint distribution as the causal factorization.
3. Remove the intervention tags from the intervened random variables that are in the probability conditions. 4. Replace each conditional probability having an intervened variable in the argument by a delta function over its chosen value.
Applying these four steps to the posterior probabilities over the environment parameters yields This equation shows that beliefs are updated only using the observations, and that actions are treated "as if they were known beforehand", thus providing no evidence. Likewise, note that Using (8) and (9), we get the probability of issuing action a t ∈ A: The important fact about (10) is it was derived only from probability theory and causal calculus by assuming policy uncertainty over a set of policies. We can therefore define the policy and prediction models of the agent as The construction of an adaptive agent with policy uncertainty then proceeds analogous to a Bayesian inference process.
1. First, we define a set of prediction models B(o t |θ, a <t , o <t ) and policy models B(a t |θ, a ≤t , o <t ), where each policy model is optimal for a particular environment θ. In the case of inference the latent random variable θ ∈ Θ corresponds to the hypothesis.
2. Second, we choose some prior probabilities B(θ) to model our prior uncertainty.
3. Third, we use the distribution B(a t |â <t , o <t ) as the agent's adaptive policy P (a t |a <t , o <t ), and the distribution B(o t |â ≤t , o <t ) as the agent's adaptive predictor P (o t |a ≤t , o <t ).
Thus, Thompson sampling is used in every time step to sample an action a t from the predictive distribution B(a t |â <t , o <t ).

Convergence & Co-Adaptation
In [17], the limit behavior of a Thompson sampling agent (11) was investigated.
Assuming that there exists a belief P (o t |θ, a ≤t , o <t ) that perfectly models the environment Q θ such that P (o t |θ, a ≤t , o <t ) = Q θ (o t |a ≤t , o <t ), then the agent (11) converges in the sense that P (a t |a <t , o <t ) → P (a t |θ, a <t , o <t ) almost surely as t → ∞ if the interaction system (P, Q) fulfills certain ergodicity requirements and all policies P (a t |θ, a <t , o <t ) are consistent. Roughly speaking, the first requirement ensures that the agent can recover from any initial mistakes, and the second requirement ensures that all predictors B(o t |θ, a ≤t , o <t ) that make the same predictions for the tail of the observation sequence are coupled to the same policy B(a t |θ, a <t , o <t ). Thus, the same beliefs imply the same behaviors. But what happens if the environment is also adaptive? As long as the agent has a model B(o t |θ, a ≤t , o <t ) that captures the adaptive behavior of the environment nothing fundamentally changes. However, the agent might not have a model about the adaptive behavior of the environment, while still having a pretty good idea about the environment's preferences. This is typically the case in game theory [32,33], where the agent knows the other agent's best response function, but has no model of the other agent's adaptive behavior. In the simplest one-shot simultaneous move games the best response functions are given by where player 1's best response BR 1 is a distribution P (a) over actions a that depends on agent 2's probability Q(o) of emitting o, and where U and V are the payoff functions for player 1 and 2 respectively. A Nash equilibrium (P * (a), Q * (o)) is a fix point of these coupled equations [34], where each individual player has no incentive to change his distribution, that is How such equilibria are reached is not subject in classic equilibrium game theory. But, if such games are repeated over and over again, evolutionary game theory suggests that these equilibria appear as fix points of adaptation dynamics like the replicator equations [35]. As Bayesian inference can also be viewed as some kind of replicator dynamics [36,37], this provides an interesting starting point to study the emergence of Nash equilibria when two adaptive Thompson sampling agents interact.
When both agents are adaptive according to (11), we can decompose the policies P (a) and Q(o) into mixture distributions, such that where P (a|θ) = BR 1 [Q(o)] and Q(o|ξ) = BR 2 [P (a)] and P (θ) and Q(ξ) are the prior distributions over the possible behaviors. Moreover, both agents will have predictive models over the other agent's behavior, such that where P (o|θ) = Q(o|ξ) and Q(a|ξ) = P (a|θ). In the following we will assume that there exists at least one pair (θ * , ξ * ) where the model P (o|θ * ) perfectly matches Q(o|ξ * ) such that P (o|θ * ) = Q(o|ξ * ) and at the same time the model Q(a|ξ * ) perfectly matches P (a|θ * ) such that P (a|θ * ) = Q(a|ξ * ). In this case both agents can predict the other agent's behavior, which means that there will be no drive to change the posteriors P (θ|D) and Q(ξ|D), given some past experience D. Then, both agents should "lock in" when their posteriors are sufficiently close to δ θ,θ * and δ ξ,ξ * . Formally, we define a pair (ξ * , θ * ) to be a strict Nash equilibrium if This corresponds to a lock-in of the predictive models that drive the adaptation process.
In order to study convergence to (θ * , ξ * ), we determine the difference in relative entropy between the predictive distribution P (o) at time t and the generative distribution Q(o|ξ * ) and the relative entropy between the predictive distribution P (o|D) at time t + 1 and the generative distribution Q(o|ξ * )-after observing D at time step t. This difference can serve as a Lyapunov function to show convergence, if we require that Given the predictive distribution P (o) = θ P (o|θ)P (θ) at time t and the predictive distribution P (o|D) = θ P (o|θ)P (θ|D) at time t + 1, we get If the prior weight P (θ * ) is close to one then the last positive term − log P (θ * ) is close to zero and the two other terms will dominate, making the whole expression negative, thus, implying convergence. The argument can also be extended to the case where D is generated by ξ Q(D|ξ)Q(ξ) instead of Q(D|ξ * ), depending again on the weight of Q(ξ * ) compared to the other weights Q(ξ), i.e. how close the other agent is to the Nash policy. The same argument is then repeated for player 2 who uses the prediction model Q(a|ξ) to model P (a|θ). Consider as an example the matching pennies game [33], where each player has a penny and must decide whether to secretely turn their penny to heads or tails.
where B(·, ·) is the beta function, and a 1 and a 2 is the number of heads and tails played by player 2, whereas b 1 and b 2 is the number of heads and tails played by player 1. In this case the only (θ * , ξ * ) pair that is a Nash equilibrium is in the 50 : 50 case, because only then do action and prediction model fit for both agents. In each time step the agents sample θ and ξ respectively from their posteriors P (θ|·) and Q(ξ|·) and act with their best response to this sample. In Figure 4 it can be seen how they co-adapt and converge to the Nash equilibrium.

Causal induction
Agent (11) can be thought of as a probabilistic superposition of models θ, where each model θ is characterized by a likelihood model P (o t |θ, ao <t a t ) and a policy model P (a t |θ, ao <t ). In previous applications we assumed that all models θ have the same causal structure, i.e. considering multivariate random variables a t and o t , we assumed that the same variables a t are intervened for all θ and the same causal model is used to predict the consequences of these interventions on the observational variables o t . However, this need not be the case.
In principle, different models θ could represent different causal structures and suggest intervention of different variables. Such a setup can be used for causal induction. Imagine, for example, we are given a device with two light bulbs, one green (X) and one red (Y ), whose states obey a hidden mechanism that correlates them positively. Moreover, the device has switches that allow us controlling the state of either bulb.. We encode the "on" and "off" states of the green light as X = x and X = ¬x respectively. Analogously, Y = y and Y = ¬y denote the "on" and "off" states of the red light. We are interested in the explanatory power of two competing hypotheses: either "green causes red" (Θ = θ) or "red causes green" (Θ = ¬θ).
One of the main methods to deal with problems of causal inference is the framework of causal graphical models [29]. Given a graph that represents a causal structure, we can intervene this graph and ask questions about the probabilities of the variables in the graph. However, in causal induction we would like to discover the causal structure itself, that is we would like to do inference over a multitude of graphs representing different causal structures [38]. If one would like to represent the problem of causal discovery graphically, the main challenge is that the model Θ is a random variable that controls the causal structure itself. That is, a tentative graphical representation would be which cannot be analyzed using the mathematical framework of graphical models alone because the random variable Θ operates on a meta-level of the graphical model over X and Y . In fact, different causal structures have to be investigated by different graphical models, that is the inference process over different causal structures cannot be represented in one and the same graphical model. However, this difficulty can be overcome by using a probability tree to model the causal structure over the random events [39]. Probability trees can encode alternative causal realizations, and in particular alternative causal hypotheses [40]. All random variables are then of the same type-no distinctions between meta-levels are needed.

Representation
We can use probability trees to represent the prediction model that the agent has about its environment. An exemplary probability tree for our problem is depicted in Figure 2. In this tree, each (internal) node is interpreted as a causal mechanism; hence a path from the root node to one of the leaves corresponds to a particular sequential realization of causal mechanisms. The logic underlying the structure of this tree is as follows: 1. Causal precedence: A node causally precedes its descendants. For instance, the root node corresponding to the sure event Ω causally precedes  all other nodes.

Resolution of variables:
Each node resolves the value of a random variable. For instance, given the node corresponding to Θ = θ and X = ¬x, either Y = y will happen with probability Pr(y|θ, ¬x) = 1 4 or Y = ¬y with probability Pr(¬y|θ, ¬x) = 3 4 . 3. Heterogeneous order: The resolution order of random variables can vary across different branches. For instance, X precedes Y under Θ = θ, but Y precedes X under Θ = ¬θ. This allows modeling different causal hypotheses.
While the probability tree represents our subjective model explaining the order in which the random values are resolved, it does not necessarily correspond to the temporal order in which the events are revealed to us. So for instance, under hypothesis Θ = θ, the value of the variable Y might be revealed before X, even though X causally precedes Y ; and the hypothesis Θ, which precedes both X and Y , is never observed.

Interventions
The importance of interventions to detect causal structure is illustrated in Figure 2, as the observational probabilities are completely symmetric for the two halves of the tree. Suppose we observe that both lights are on. Have we learned anything about their causal dependency? A brief calculation shows that this is not the case because the posterior probabilities are equal to the prior probabilities: Pr(θ|x, y) = Pr(y|θ, x)Pr(x|θ)Pr(θ) Pr(y|θ, x)Pr(x|θ)Pr(θ) + Pr(x|¬θ, y)Pr(y|¬θ)Pr(¬θ) This makes sense intuitively, because by just observing that the two lights are on, it is statistically impossible to tell which one caused the other. The only way to extract causal information is then to intervene, paraphrased as "no causes in, no causes out" [41] or "to find out what happens when you kick the system, you have to kick the system" [42]. Thus, we now repeat our experiment, but this time we turn on the green light (X = x). We reflect this choice by changing all the mechanisms that resolve the random variable X, placing all the probability mass on the outcome X = x (see Figure 2b). Assume that we subsequently observe that the second light is on. Then, the posterior probabilities are Pr(θ|x, y) = Pr(y|θ,x)Pr(x|θ)Pr(θ) Pr(y|θ,x)Pr(x|θ)Pr(θ) + Pr(x|¬θ, y)Pr(y|¬θ)Pr(¬θ) wherex is Pearl's notation to indicate a causal intervention of X. Since P (θ) < P (θ|x, y), we have gathered evidence favoring the hypothesis "green causes red". This was only possible because our intervention introduced a statistical asymmetry among the two hypotheses that did not exist before.

Thompson Sampling
Naturally, multiple interventions and observations can be executed in consecution. In this case Thompson sampling is used in each time step to decide which policy model to use, which implies the decision which variables to intervene. Then, after the intervention, all variables are revealed simultaneously at every time step of the inference process. The update of the observational probabilities is done the same way as in the one step case, taking into account which variables were intervened. A simulation of the repeated Thompson sampling process for causal induction of our example system is shown in Figure 3.

Discussion
Equations (11) was first derived in [17] as the optimal solution to the adaptive coding problem given actions and observations as Bayesian rule for control. In practice, it is implemented by sampling an environment parameterθ t for each time step from the posterior distribution B(θ|â <t , o <t ), and then treating it as if it was the true parameter-that is, issuing the action a t from B(a t |θ t , a <t , o <t ). This action-sampling method where beliefs are randomly instantiated was first proposed as a heuristic in [1] and is now known as Thompson sampling. Equations (11) therefore provides a method for generalized Thompson sampling applicable to adaptive sequential decision-making problems.
The main contribution of this paper is to examine three features of such generalized Thompson sampling. First, we provide an argument showing that Thompson sampling is a natural consequence of a Bayesian treatment of policy uncertainty. Policy uncertainty arises whenever an agent is trying to find an optimal policy, but is unable to do so-for example due to computational  (Left) Posterior distribution P (θ|·) for 10 runs when the true system is given by Θ = θ. (Right) Posterior distribution P (θ|·) for 10 runs when the true system is given by Θ = ¬θ. In both cases the agent is able to identify the causal structure of the environment with high confidence when P (θ| · · · ) is close to one or zero respectively. constraints-, even though the agent might have a coarse idea about the optimum, which can be expressed as a probability distribution. The Bayesian treatment of this uncertainty is analogous to Bayesian estimation in the case of pure observation problems. The Bayes-optimal estimator in this case is not point estimate, but a distribution, which forgoes the bias-variance dilemma. Similarly, in the case of actions, the exploration-exploitation trade-off can be circumvented by Thompson sampling from probabilistic policies expressed as Bayesian mixture distributions.
Second, we investigated co-adaptation of two adaptive Thompson sampling agents. We could demonstrate that such agents converge to Nash equilibria, if the parameterized policy set they are choosing from is given by the parameterized best response functions. This approach also generalizes previous models of fictitious play [43,44] that best-respond to the observed frequency of the opponent's play rather than best-responding to their randomized beliefs about the opponent. Therefore, adaptive Thompson sampling agents might provide a useful modeling tool for evolutionary game theory [35] and learning in games in general [45].
Third, we could demonstrate that generalized Thompson sampling can also be applied to the problem of causal induction, by designing policy and prediction models with different causal structures. This way generalized Thompson sampling can be used as a general method for causal induction that is Bayesian in nature. It is based on the idea of combining probability trees [40] with interventions [29] for predicting the behavior of a manipulated system with multiple causal hypotheses. Both the interventions and the constraints on the causal hypotheses introduce statistical asymmetries that permit the extraction of causal information. Unlike frameworks that aim to extract causal information from ob-servational data alone [46,47,48], the proposed method is designed for agents that interact with their environment and use these interactions to discover causal relationships.
So far Thompson sampling has been mainly applied to multi-armed bandit problems. Multi-armed bandits can be represented by a parameter θ that summarizes the statistical properties of the reward obtained for each lever. Reward distributions range from Bernoulli to Gaussian (with unknown mean and variance), and they can also depend on the particular context or state [9,11,8,10]. In particular, the work of [11] and the work of [8] prove asymptotic convergence of Thompson sampling. The work of [12] presents empirical results that show Thompson sampling is highly competitive, matching or outperforming popular methods such as UCB [25,26].
Another class of problems, where Thompson sampling has been applied in the past, are Markov decision processes (MDPs). MDPs can be described by parameterizing the dynamics and reward distribution (model-based) [49] or by directly parameterizing the Q-table (model-free) [50,31]. The first approach samples a full description of an MDP, solves it for the optimal policy, and then issues the optimal action. This is repeated in each time step. The second approach avoids the computational overhead of solving for the optimal policy in each time step by directly doing inference on the Q-tables. Actions are chosen by picking the one having the highest Q-value for the current state. The same ideas can also be applied to solve adaptive control problems with linear system equations, quadratic cost functions and Gaussian noise [51].

Optimality
One of the main arguments is that the derivation presented in Section 3.3 shows that generalized Thompson sampling is not just a heuristic method, but that it can be derived under the assumption of policy uncertainty-simply by applying probability theory and causal calculus. This Thompson sampling approach differs from the formulation of adaptive control problems as control problems with known environments that require the maximization of a subjective expected utility criterion-compare Section 2.3. The difference between the two approaches can be highlighted by contrasting the two one-step decision scenarios depicted in Figure 4. The goal is to predict the outcome of a biases coin with payoffs $1 and $0 for a right and wrong guess respectively. A rational decision maker places bets (shown inside speech bubbles) such that his subjective expected utility is maximized. These subjective beliefs are delimited within dotted boxes.
The difference between the two becomes clear by inspecting the expected utility in each case: they are respectively. Here it is clearly seen that the difference between the two lies in the order in which we apply the expectation (over the environment parameter) and the maximization operator. Both cases can be explained in terms of optimality. However, in (a), decision-maker picks his action taking into account the uncertainty over the bias, while in (b), the decision-maker picks his action only after his beliefs over the coin bias are instantiated-that is, he is optimal w.r.t. his random beliefs.
The difference between probabilities that one takes into account when making a decision versus the probabilities that are not (i.e. they are immeasurable) has been first proposed by [52]. The classical decision theories of [18] and [19] only consider known probabilities that are reasoned about inside the max-operation. Another example where random beliefs play a crucial role is in games with incomplete information [33]. Here, having incomplete information about the other player leads to a infinite hierarchy of meta-reasoning about the other player's strategy. To avoid this difficulty, Harsanyi introduced Bayesian games [53]. In a Bayesian game, incomplete knowledge is modeled by randomly instantiating the player's types, after which they choose their strategies optimally-thus eliminating the need for recurrent reasoning about the other players' strategy.
Maintaining and updating Bayesian probabilities is an optimally efficient way to deal with uncertainty-be it with respect to the policy or the environment [31]. Therefore, the central claim is that having random beliefs-as formalized by generalized Thompson sampling-can be considered optimal under the constraint of having policy uncertainty-uncertainty that is inevitable whenever we are unable to compute the optimal policy. Having policy uncertainty effectively weakens the two assumptions of the maximum expected utility principle: the optimal policy can be chosen and refined during interactions, and the computational complexity is lower.
The operational distinction of having policy uncertainty has important algorithmic consequences. When there is policy uncertainty, the belief of the decision-maker is itself a random variable. This means that the very policy is undefined until the random variable is resolved. Hence, the computation of the optimal policy can be delayed and determined dynamically. It is precisely this fact that is (implicitly) exploited in popular reinforcement learning algorithms, and explicitly in the algorithms based on random beliefs. This is in stark contrast to the case when there is no policy uncertainty, where the policy is pre-computed and static.

Adaptive Coding and the Kullback-Leibler Divergence
Even though the maximization is inside the expectation in case of random belief approaches to decision-making, there is another outer maximization or optimality criterion implicit, analogous to the case of Bayesian inference that is known to optimize Kullback-Leibler divergences. Therefore, it is useful to think about the adaptive control problem as an adaptive coding and inference problem [17,54]. In terms of the initial problem statement in Section 2, the question then is: How can the designer construct a system P defined by P (o t |a ≤t , o <t ) and P (a t |a <t , o <t ) such that its behavior is as close as possible to the custommade system B(o t |θ, a ≤t , o <t ) and B(a t |θ, a <t , o <t ) under any realization of Q θ ? Using the Kullback-Leibler divergence as a distance measure, we can formulate a variational problem in P P := arg min In the case of observations, this is a well-known variational principle for Bayesian inference, as it describes a predictor that requires, on average, the least amount of extra bits to capture informational surprise stemming from the behavior of the environment. In the case of actions, the same principle can be harnessed to describe resourceful generation of actions in a way that requires random bits with minimum length on average, when trying to match the optimal policy most suitable for the unknown environment [55].

Evolutionary game theory
When dealing with adaptive agents, one of the most intriguing questions is what happens if two adaptive agents are coupled. Classic game theory does not really allow to address that question, as it abstracts away from learning and adaptation processes and focuses on fix point conditions for equilibria. In contrast, evolutionary game theory focuses on the dynamics that can lead to equilibria [35]. One of the most widely studied dynamics in evolutionary game theory are the so-called replicator equations , where x i t represents the proportion of type i in a population of individuals at time t. The vector x t = (x 1 t , . . . , x n t ) represents the population distribution, such that j x j t = 1. The function f i (x) denotes the fitness of type i, which depends on the population x. The proportion of individuals of type i at the next time point t + 1 is determined by the fitness share this type achieves compared to the population total.
Interestingly, there is a formal correspondence between the replicator dynamics and Bayesian inference [36,37] where p(h) and p(h|d) represents the prior and posterior probability mass allotted to hypothesis h. The likelihood function p(d|h) works as a fitness landscape. The posterior probability is determined by the likelihood fitness achieved compared to the overall evidence P (d) = h ′ p(h ′ )p(d|h ′ ). In evolutionary game theory the fix points of the replicator dynamics have been studied extensively. In particular, evolutionarily stable strategies [56] have been shown to be a refinement of the common Nash equilibrium, in the sense that such Nash equilibria are stable with respect to perturbing mutant strategies. Since generalized Thompson sampling as described in (11) shares its form with Bayesian inference, the connection to evolutionary replicator equations is immediate. Therefore, we could apply very similar stability arguments in the case of two interacting adaptive agents (11), as previously applied in the case of the replicator dynamics. Generalized Thompson sampling might therefore also provide a useful tool in the future to study convergence of co-adaptation processes within the context of evolutionary game theory.

Causality
To construct the Bayes-causal solution in Section 3.3, we needed to treat actions as interventions. This raises the question about why this distinction was not made for deriving classical SEU solutions.
Inspecting (13) we conclude that B(a k |θ, a <k , o <k ) = δā k (a k ), i.e. the actions have to be issued deterministically (but possibly history-dependent) from a unique policy. Intuitively speaking, this is because the operations of intervening and conditioning coincide when the random variables are deterministic.

Open Problems
There are important cases where random belief approaches can fail. Indeed, it is easy to devise experiments where having policy uncertainty converges exponentially slower (or does not converge at all) than knowing the optimal policy. Consider the following simple example: Environment 1 is a k-state MDP in which only k consecutive actions A reach a state with reward +1. Any interception with a B-action leads back to the initial state. A second environment which is like the first but actions A and B are interchanged. The optimal policy figures out the true environment in k actions (either k consecutive A's or B's). Consider now an agent with random beliefs: The optimal action in environment 1 is A, in environment 2 is B. A uniform ( 1 2 , 1 2 ) prior over the two possible environments stays a uniform posterior as long as no reward has been observed. Hence, an agent with random beliefs chooses at each time-step A and B with equal probability. With this policy it takes about 2 k actions to accidentally choose a row of A's (or B's) of length k. From then on the agent acts optimally too. Thus, the optimal policy converges in time k, while the agent with policy uncertainty needs exponentially longer. A simple way to remedy this problem is, of course, to sample random beliefs only every k time steps. But this problem can be exacerbated in non-stationary environments. Take for instance, an increasing MDP with k = 10 √ t , in which the optimal policy converges in 100 steps, while an agent with policy uncertainty would not converge at all in most realizations.
Although [17] prove asymptotic convergence for general environments fulfilling a restrictive form of ergodicity condition, this condition needs to be weakened for the convergence proof to be applicable to most real problems. But it is clear that a form of ergodicity is required for an agent with policy uncertainty to be able to learn to act optimally. Intuitively, this means that an agent can only learn if the environment has temporally stable statistical properties. Finally, determining the speed of convergence and the regret is currently an open problem.

Conclusion
In this paper we have argued that policy uncertainty is a natural phenomenon that arises whenever there are not enough computational resources to apply the maximum SEU principle. We have shown that treating this uncertainty in a Bayesian way with actions as random variables that obey causal calculus naturally leads to Thompson sampling and its Bayesian generalization. This generalized Thompson sampling can be straightforwardly applied to evolutionary game theory and to the problem of causal induction. As these random-belief approaches can be derived simply from probability theory and causal calculus we suggest that they should not be considered as mere heuristics but as wellfounded principled approaches.

Acknowledgments
This study was supported by the DFG, Emmy Noether grant BR4164/1-1.