Thermodynamics as a theory of decision-making with information processing costs Pedro A. Ortega and Daniel A. Braun July 31, 2012 Abstract Perfectly rational decision-makers maximize expected utility, but cru- cially ignore the resource costs incurred when determining optimal actions. Here we propose an information-theoretic formalization of bounded ratio- nal decision-making where decision-makers trade off expected utility and information processing costs. Such bounded rational decision-makers can be thought of as thermodynamic machines that undergo physical state changes when they compute. Their behavior is governed by a free en- ergy functional that trades off changes in internal energy—as a proxy for utility—and entropic changes representing computational costs induced by changing states. As a result, the bounded rational decision-making problem can be rephrased in terms of well-known concepts from statis- tical physics. In the limit when computational costs are ignored, the maximum expected utility principle is recovered. We discuss the relation to satisficing decision-making procedures as well as links to existing theo- retical frameworks and human decision-making experiments that describe deviations from expected utility theory. Since most of the mathematical machinery can be borrowed from statistical physics, the main contribution is to axiomatically derive and interpret the thermodynamic free energy as a model of bounded rational decision-making. 1 Introduction In everyday life decision-makers often have to make fast and frugal choices [1, 2]. Consider, for example, an antelope that quickly has to choose a direction of flight when faced with a predator. By the time an antelope had considered all possible flight paths to determine the optimal one, it would most probably be already eaten. In general, decision-makers seem to trade off the expected desirability of the consequences of an action against the effort and resources (time, money, food, computational effort, knowledge, opportunity costs, etc.) required for searching the optimum [3, 4]. Classic theories of decision making generally ignore information-processing costs by assuming that decision makers always pick the option with maximum 1 return—irrespective of the effort or the resources it might take to find or com- pute the optimal action [5, 6, 7]. Such decision-makers are described as perfectly rational. However, being perfectly rational seems to contradict our intuition of real-world decision-making, where information processing constraints play an important role [1]. This has led to an abundant literature on bounded rational- ity [8, 9, 10, 11]. Unlike perfectly rational decision makers, bounded rational decision-makers are subject to information processing constraints, that is they may have limited time and speed to process a limited amount of information. 1.1 Thermodynamic Intuition a) b) pV (1 − p)V A B Figure 1: The Molecule-In-A-Box Device. (a) Initially, the molecule moves freely within a space of volume V delimited by two pistons. The compartments A and B correspond to the two logical states of the device. (b) Then, the lower piston pushes the molecule into part A having volume V ′ = pV . Here we follow a thermodynamic argument [12] that allows measuring re- source (or information) costs in physical systems in units of energy. The gen- erality of the argument relies on the fact that ultimately any real agent has to be incarnated in a physical system, as any process of information processing must always be accompanied by a pertinent physical process [13]. In the fol- lowing we conceive of information processing as changes in information states (i.e. ultimately changes of probability distributions), which consequently im- plies changes in physical states, such as flipping gates in a transistor, changing voltage on a microchip, or even changing location of a gas particle. Such state changes in physical systems are not for free, that is the do not happen sponta- neously. Consequently, if we want to control a physical system into a desirable state we also have to take into consideration that changing from the current state to the desirable state incurs a cost. According to Landauer’s principle, one can postulate a formal correspon- dence between one unit of information and one unit of energy [14, 15, 16]. Consider representing one bit of information using one of the following logical devices: a molecule that can be located either on the top or the bottom part of 2 a box; a coin whose face-up side can be either head or tail; a door that can be either open or closed; a train that can be orientated facing either north or south; and so forth. Assume that all these devices are initialized in an undetermined logical state, where the first state has probability p and the second probability 1 − p. Now, imagine you want to set these devices to their first logical state. In the case of the molecule in a box, this means the following. Initially, the molecule is uniformly moving around within a space confined by two pistons as depicted in Figure 1a. Assuming that the initial volume is V , the molecule has to be pushed by the lower piston into the upper part of the box having volume V ′ = pV (Figure 1b). From information theory, we know that the number of bits that we fix by this operation is given by − log p. To make things concrete, we assume that the device has diathermal walls and is immersed in a heat bath at constant temperature T . Since the walls are diathermal, the temperature inside of the box is maintained at the temperature of the heat bath. We model the particle as an ideal gas. When an ideal gas is compressed under isothermal conditions from an initial volume V to a final volume V ′, then the work is calculated as W = − � V ′ V NkT V dV = NkT ln V V ′ , (1) where N ≥ 0 is the amount of substance and k > 0 is the Boltzmann constant. The minus sign is just a convention to denote work done by the piston rather than by the gas. If we assume N = 1 and make use of the fact that V ′ = pV we get W = kT ln V pV = −kT ln p = − kT log e log p = −γmol log p, where the constant γmol := RT log e > 0 can be interpreted as the conversion factor between one unit of information and one unit of energy for the molecule-in-a-box device. How do we compute the information and work for the case of the coin, door and train devices? The important observation is that we can model these cases as if they were like molecule-in-a-box devices, with the difference that their conversion factors between units of information and units of work are different. Hence, the number of bits fixed while these devices are set to the first state is given by − log p, i.e. exactly as in the case of the molecule. However, the work is given by −γcoin log p, −γdoor log p, and − γtrain log p respectively, where γcoin, γdoor and γtrain are the associated conversion factors between units of information. Obviously, γmol ≤ γcoin ≤ γdoor ≤ γtrain. The point is that changes in knowledge states are costly and that these costs are proportional to the information. In the next section, we derive a general ex- pression of information costs in physical systems that make decisions. 3 2 Information-Theoretic Foundations 2.1 Resource Costs We model any observable sequential process, such as a sequence of interactions or a sequence of computation steps, as a filtration on a measure space. To simplify our exposition, we consider only finite measure spaces. Let (Ω, Σ) denote a measurable space, where Ω denotes the sample space and where Σ is a σ-algebra on Ω. Let p be a conditional probability measure on (Ω, Σ), such that for any two events A, B ∈ Σ, p(A|B) denotes the conditional probability of the A given B, where the condition B plays the role of the current information state of the process. The sequential realization of a process is modelled as a sequence of conditions A1, A2, . . . , AT on the sample space Ω, where each new condition At refines the current information state � τ≤t Aτ by excluding the complement A∁ t . We further assume that a transformation of an information state from B to (A ∩ B) entails a cost ρ(A|B) that could be measured in dollars, time or any arbitrary scale of effort. Moreover, we assume that this transformation cost is decomposable; that is, if we undergo a knowledge change from C to (A∩B ∩C), then we should pay the same cost as undergoing a change first from C to (B∩C) and then from (B ∩ C) to (A ∩ B ∩ C). Finally, the quintessential information- theoretic postulate is that conditional probabilities impose a monotonic order over transformation costs1. We can sum up our postulates as follows: Definition 1 (Axioms of Transformation Costs). Let (Ω, Σ) be a measurable space and let p : (Σ × Σ) → [0, 1] be a conditional probability measure over Σ (i.e. for any A ∈ Σ, p(·|A) is a probability measure over A). A function ρ : (Σ × Σ) → R+ is a transformation cost function for p iff it has the following three properties for all events A, B, C, D ∈ Σ: A1. real-valued: ∃f, ρ(A|B) = f � p(A|B) � ∈ R, A2. additive: ρ(A ∩ B|C) = ρ(A|C) + ρ(B|A ∩ C), A3. monotonic: [ρ(A|B) > ρ(C|D)] ⇔ [p(A|B) ≶ p(C|D)]. These three properties enforce a strict correspondence between probabilities and transformation costs [18, 19]. Theorem 1 (Transformation Costs ↔ Probabilities). If f is such that ρ(A|B) = f(p(A|B)) for every choice of the probability space (Ω, Σ, p), then f is of the form f(·) = − 1 β log(·), where β is a real parameter. 1This intuition is central for optimal coding theory where short codewords are assigned to frequent events and long codewords are assigned to rare events [17]. Therefore, we could regard the codeword length as a valuable resource that we have to bet on events with different probabilities. 4 That is, the transformation cost ρ(A|B) is proportional to the information content − log p(A|B), where the parameter β plays the role of the conversion factor. The logarithmic mapping between probabilities and “costs” is well- known in information theory, and there are many possible ways to derive it [20, 21]. The important observation is that our derivation stems purely from postulates regarding transformation costs. According to Definition 1, transformation costs measure the relative cost of an event relative to a reference event. However, we can also introduce an absolute cost measure to single events such that transformation costs are obtained as differences. Definition 2 (Potential). Let ρ be a transformation cost function. A set func- tion φ : Σ → R is called a cost potential for ρ iff for all A, B ∈ Σ, φ(Ω) := φ0 φ(A ∩ B) := φ(B) + ρ(A|B) ∀A, B ∈ Σ, where φ0 is an arbitrary real value. One can easily verify that this potential is well defined for all events, and that ρ(A|B) = φ(A ∩ B) − φ(B). It captures the intuition that starting out from the high-probability event B with potential φ(B) one has to pay the cost ρ(A|B) to arrive at the low-probability event A ∩ B with potential φ(A ∩ B). In the following, consider a reference set S ∈ Σ having a measurable parti- tion X. Cost potentials have an important recursive structure: the cost potential of an event is uniquely determined by the potential of its constituent events. If X is a measurable partition of a reference event S ∈ Σ, then φ(S) = − 1 β log � x∈X e−βφ(x). (2) Furthermore, the probability of a member x ∈ X of the partition relative to S can be expressed as a Gibbs measure: p(x|S) = e−βφ(x) e−βφ(S) = e−βφ(x) � x∈X e−βφ(x) . (3) In statistical physics it is well-known that the Gibbs measure satisfies a varia- tional principle in the free energy, which is defined as Fβ[q] := � x∈X q(x)φ(x) + 1 β � x∈X q(x) log q(x). (4) More specifically, it is well known that for any probability measure q over the partition X of S, F[q] ≥ F[p] = − 1 β log φ(S), (5) where the lower bound is attained by the Gibbs measure p(x) ∝ e−βφ(x). Equa- tions (2) to (5) constitute fundamental results that will be generalized and interpreted in the next section. 5 2.2 Gains and Losses Equipped with the results from the preceding section, we can now proceed to model a bounded rational decision maker. Because transformation costs matter, we model a decision as a transformation of a prior behavior into a final behavior, where we represent the direction of change as a utility criterion. The Gibbs measure in (3) allows us describing a probability measure p over a partition X in terms of a cost potential φ over X. In particular, we see that a decision-maker’s a priori behavior or belief described by p0(x) and φ0(x) changes to p(x) and φ(x) if he is exposed to the gain (or loss) U(x), such that φ(x) = φ0(x) − U(x) (6) and p(x) ∝ e−βφ0(x)+βU(x) ∝ p0(x)eβU(x) (7) as illustrated in Figure 1. The function U represents either gains or losses and not absolute levels of costs, because it expresses a difference in the potential U(x) = φ0(x)−φ(x). The equilibrium distribution (7) that arises in a change can also be characterized in terms of a variational principle, in a manner analogous to (5). Theorem 2 (Negative Free Energy Difference). Let p0(x) and p(x) be the Gibbs measures with potentials φ0(x) and φ(x) and resource parameter β. Let F0 and F be the free energies minimized by p0 and p respectively. Then, the negative free energy difference −∆F = F0 − F is −∆F = � x∈X p(x)U(x) − 1 β � x∈X p(x) log p(x) p0(x), (8) where U(x) = φ0(x) − φ(x). Since the difference in the negative free energy −∆F = F − F0 has the same dependency on p as the free energy F, we can use −∆F directly as a variational principle in p. Corollary 3 (Variational Principle). The negative free energy difference pro- vides a variational principle for the equilibrium distribution, i.e. −∆F[q] := � x∈X q(x)U(x) − 1 β � x∈X q(x) log q(x) p0(x) is maximized by p(x) = 1 Z p0(x)eβU(x), where Z := � x∈X eβU(x). Furthermore, ∆F[q] ≤ ∆F[p] = 1 β log Z. 6 φ0 −U φ = φ0 − U Initial Final low high Probability Figure 2: Representing a decision maker as a thermodynamic system, the be- havior of the decision-maker exposed to a gain U can be expressed as a change of his initial cost potential φ0 to a final cost potential φ, where φ = φ0 − U. The choice or belief probabilities of the decision-maker change according to (7) from p0 to p. 2.3 Choice & Belief Probabilities The distribution (7) can be interpreted both as an action or observation prob- ability in the context of bounded rational decision-making. In the case of ac- tions, p0 represents the a priori choice probability of the agent which is refined to the choice probability p when evaluating the imposed gain (or loss) U. The associated change in probability depends on the resource parameter β and cor- responds to the computation that is necessary to evaluate the gains (or losses). In the case of observations, p0 represents the a priori belief of the agent given by a probabilistic model, which is then distorted due to the presence of possible gains (or losses) that are evaluated by the holder of the belief. This way, model uncertainty and risk-aversion can be parameterized by β. For different values of β the distribution (7) has the following limits lim β→∞ p(x) = δ(x − x∗), x∗ = max x U(x) lim β→0 p(x) = p0(x) lim β→−∞ p(x) = δ(x − x∗), x∗ = min x U(x). In the case of actions the three limits imply the following: The limit β → ∞ corresponds to the perfectly rational actor that infallibly selects the action that maximizes gain (or minimizes loss −U(x). The limit β → 0 is an actor without resources that simply selects his action according to his prior. The limit β → −∞ corresponds to an actor that is perfectly “anti-rational” and always selects the action with the worst outcome. In the case of observations the three limits correspond to an extremely optimistic observer (β → ∞) who believes only in the best possible outcome, an extremely pessimistic observer (β → −∞) who anticipates only the worst, and a risk-neutral Bayesian observer (β → 0) who simply relies on the probabilistic model p0. 7 2.4 The Certainty Equivalent In statistical physics [22], the free energy difference ∆A = ∆E − Q = W measures the amount of available “good energy” (work W) by subtracting the “bad energy” (heat Q) from the total energy ∆E = E[U]. The crucial physical intuition is that we have uncertainty about some aspects of the objects that make up the heat energy, for example we do not know the exact trajectories of all gas particles at temperature β. This uncertainty means that we do not have full control over the objects and cannot extract all the energy as work [12]. Economically speaking, the physical concept of work, and therefore also the difference in free energy, measures the certainty equivalent of a gain (or loss) that is contaminated by uncertainty. In general, we can therefore use the free energy difference to ascribe a certainty equivalent value to choice situations of the form (7). As can be seen from Corollary 3, this value is given by the log partition function, i.e. the logarithm of the normalization constant Z. For different values of β, the certainty equivalent takes the following limits lim β→∞ 1 β log Z = max x U(x) lim β→0 1 β log Z = � x p0(x)U(x) lim β→−∞ 1 β log Z = min x U(x). Again, the case β → ∞ corresponds to the perfectly rational actor (or the extremely optimistic observer), the case β → −∞ corresponds to the perfectly “anti-rational” actor (or the extremely pessimistic observer) and the case β → 0 corresponds to the actor that has no resources (or the risk-neutral observer) such that the best one can expect is the expected gain or loss. Corollary 3 has two interpretations in statistical physics, either as an in- stantiation of a minimum energy principle or as a maximum entropy principle [22]. Accordingly, (7) can either be seen as the distribution that maximizes the entropy given a constraint on the expectation value of U or as the distribution that minimizes the expectation of −U given a constraint on the entropy of p. In the context of observer modeling, the first interpretation provides a principle for estimation and the second interpretation provides a principle for bounded rational decision-making in the case of acting, which is a maximum expected gain principle with a relative entropy constraint that bounds the information- processing capacity of the decision-maker. In the relative entropy we recognize the term 1 β log p(x) as our transformation costs ρ from Theorem 1 such that we can express the negative free energy difference −∆F as −∆F = E[U] − E[R], where U(x) = φ0(x)− φ(x) represents gains (or losses) and R(x) = ρ(x)− ρ0(x) represents the extra resource costs required to achieve the gain (or loss) U. 8 We can therefore see how the variational principle of Corollary 3 formalizes a trade-off between expected gains (or losses) and information processing costs. 3 Summary of Main Concepts In decision theory, choices between alternatives are usually formalized as choices between lotteries, where a lottery is formalized as a set X of possible out- comes, a probability distribution p0 over X, and a real-valued function U over X called the utility function. In particular expected utility theory predicts that a decision-maker always chooses the lottery with the higher expected utility E[U] = � x p0(x)U(x). Here we introduce the notion of a bounded lottery as a lottery that is additionally characterized by a resource parameter β ∈ R that captures the resource constraints of the decision-maker. We have derived a thermodynamic framework for bounded lotteries from simple axioms that measure information processing cost—see also [19]. The most important difference of bounded decision-making compared to perfectly rational decision-making is that the bounded decision-maker will not be able to choose infallibly the best lottery. In fact, the resource constraints lead to stochastic choice behavior which can be characterized by a probability distribu- tion. The decision process then transforms an initial choice probability p0 into a final choice probability p by taking into account the utility gains (or losses) and the transformation costs. This transformation process can be formalized as p(x) = 1 Z p0(x)eβU(x), where Z = � x′ p0(x′)eβU(x′). (9) Accordingly, the choice pattern of the decision-maker is predicted by the prob- ability p. Crucially, the probability p extremizes the variational principle max p � � x p(x)U(x) − 1 β � x p(x) log p(x) p0(x) � . (10) These two terms can be interpreted as determinants of bounded rational decision- making in that they formalize a trade-off between an expected utility gain (first term) and the information processing cost of transforming p0 into p (second term). The certainty equivalent value of a bounded lottery can be obtained by inserting the choice probability p from (9) into (10), yielding V = 1 β log �� x p0(x)eβU(x) � , (11) which corresponds to the log partition sum. For different values of β, the cer- 9 a) b) Umax Umin E[U] β β1 β1 β2 β2 β3 β3 Figure 3: a) Negative free energy difference ∆F versus the resource parame- ter β. The resource parameter allows modeling decision-makers with bounded resources, either when generating their own actions (β > 0) or when anticipat- ing their environment (β < 0). The negative free energy difference corresponds to the certainty equivalent. b) Distribution over the outcomes depending on the resource parameter β. For large positive β the distribution concentrates on the outcome with maximum gain Umax. For large negative β the distribution concentrates on the worst outcome with gain Umin. For β = 0 the outcomes follow the given distribution p0. tainty equivalent takes the following limits lim β→∞ 1 β log Z = max x U(x) lim β→0 1 β log Z = � x p0(x)U(x) lim β→−∞ 1 β log Z = min x U(x). The case β → ∞ corresponds to the perfectly rational actor (or the extremely optimistic observer), the case β → −∞ corresponds to the perfectly “anti- rational” actor (or the extremely pessimistic observer) and the case β → 0 corresponds to the actor that has no resources (or the risk-neutral observer) such that the best one can expect is the expected gain or loss. For illustration see Figure 2. 10 4 Bounded Rationality and Satisficing Herbert Simon [23] proposed in the 50s that bounded rational decision-makers do not commit to an unlimited optimization by searching for the absolute best option. Rather, they follow a strategy of satisficing, i.e. they settle for an option that is good enough in some sense. Since then, it has been debated whether sat- isficing decision-makers can be described as bounded rational decision-makers that act optimally under resource constraints or whether optimization is the wrong concept altogether [11]. If decision-makers did indeed explicitly attempt to solve such a constrained optimization problem, this would lead to an infinite regress and the paradoxical situation that a bounded rational decision-maker would have to solve a more complex (i.e. constrained) optimization problem than a perfectly rational decision-maker. To resolve this paradox, the bounded rational decision maker must not be able to reason about his constraints. He just searches randomly for the best option, until his resources run out. An observer will then be able to assign a probability distribution to the decision-maker’s choices and investigate how this probability distribution changes depending on the available resources. Consider, for example, an anytime algorithm that will compute a solution more and more precisely the more time it has at its disposal. As one does not want to wait forever for an answer, the anytime computation will be interrupted at some point where one assumes that the answer is going to be good enough. This concept of satisficing can be used to interpret Equation 7 which describes the choice rule of a bounded rational decision-maker. Consider the problem of picking the largest number in a sequence U0, U1, U2, . . . of i.i.d. data, where each Ui ∈ U is drawn from a source with probability dis- tribution µ. This could be, for instance, an urn with numbered balls that we draw with replacement and we always keep track of the largest number seen so far. After m draws the largest number will be given by v := max{U1, U2, . . . , Um}. Naturally, the larger the number of draws, the higher the chances of observing a large number. The cumulative distribution function of choosing v after m draws is given by Fm(v) = F0(v)m, (12) where F0 is the cumulative distribution function of µ [24]. If we only cared about finding the maximum with absolute certainty then we would need to draw an infinite amount of times. However, a bounded rational decision-maker would stop after a certain time, when he feels that the benefit of further exploration does not justify the effort of further drawings. Thus, the number of draws in this example can be regarded as a resource and the numbers on the balls can be regarded as utilities. The behavior of the bounded rational decision-maker is then stochastic even though he acts perfectly deterministically, in the sense that he chooses option v with probability (12) given the resource constraint m. According to (12), the more resources a decision-maker spends, the more he 11 a) b) M = 0 M = 8 M = 32 M = 128 Umax 0 M E[v] E[v] − M · c Figure 4: a) Distributions over the maximum for various sample sizes (M + 1). The distribution µ over the ten values v in U = {1, 2, 3, . . ., 10} follows a truncated Poisson distribution with parameter λ = 5, as can be seen in the plot for M = 0. The distribution approaches a delta function over v = 10 for increasing values of M. b) The expected maximum v versus sample size (M +1). The incremental gain of the expected maximum is marginally decreasing as the sample size increases (red). If the sampling process is associated with a cost— e.g. c = 0.02 per sample in the figure—, then the penalized expected maximum (black) reaches a unique maximum for a finite sample size—the optimal sample size is M = 35 in the figure. resembles a perfectly rational decision-maker that chooses the maximum number (Figure 1a), since the expected utility increases monotonically with the amount of resources spent (Figure 1b). Importantly, however, note that the marginal increase in the expected utility diminishes with larger effort—hence larger and larger effort pays out less and less in the end. Below we formalize this trade-off. Here we show that the boundedness parameter β plays an analogous role to the number of draws m. In the limit of a continuous cumulative function F0, the density after m draws is given by pm(v) = d dvF0(v)m. We can now compute the log odds for two random outcomes v and v′, which results in log pm(v) pm(v′) = (m − 1) log F0(v) F0(v′) + log µ(v) µ(v′), where F0(v) is again the cumulative of µ. If we require the probabilities pm(v) to be representable by a distribution of the exponential family such that pm(v) = µ(v) exp(αU(v)) � dv′µ(v′) exp(αU(v′)), we see that the log odds have the following relation log pm(v) pm(v′) = α (U(v) − U(v′)) + log µ(v) µ(v′). We see that α and m play the role of the number of samples or computations. In general, the following theorem can be shown to hold. 12 Theorem 4. Let X be a finite set. Let Q and M be strictly positive probability distributions over X. Let α be a positive integer. Define Mα as the probability distribution over the maximum of α samples from M. Then, there are strictly positive constants δ and ξ depending only on M such that for all α, ���� Q(x)eαU(x) � x′ Q(x′)eαU(x′) − Mα(x) ���� ≤ e−(α−ξ)δ. Consequently, one can interpret the inverse temperature as a resource param- eter that determines how many samples are drawn to estimate the maximum. Note that the distribution M is arbitrary as long as it has the same support as Q. This interpretation can be extended to a negative α, by noting that αU(x) = (−α)(−U(x)), i.e. instead of the maximum we take the minimum of −α samples. 5 Sequential Decision-Making In the case of sequential decision-making the assumption of uniform temper- atures has to be relaxed—the proofs of the following theorems can be found in [25]. In general, we can then dedicate different amounts of computational resources to each node of a decision tree. However, this requires a translation between a tree with a single temperature and to a tree with different tempera- tures. This translation can be achieved using the following theorem Theorem 5. Let P be the equilibrium distribution for a given inverse tem- perature α, utility function U and reference distribution Q. If the temperature changes to β while keeping P and Q fixed, then the utility function changes to V (x) = U(x) − � 1 α − 1 β � log P(x) Q(x). If we now define the reward as the change in utility of two subsequent nodes, then the rewards of the resulting decision tree are given by R(xt|x 0) add value to the expected util- ity in the face of variability. Risk-sensitivity biases the beliefs about the environment optimistically (collaborative environment) or pessimistically (adversarial environment). Alternatively, one could regard a collabora- tive environment also as a bounded rational controller that can choose its own observation—that is the environment behaves like an extension of the agent with partial control. Importantly, the stress function is typically assumed in risk-sensitive control schemes in the literature, whereas here it falls out naturally—see [29] for more details. 4. Robust control. When assuming β(x