intellecton/papers/project_paper_4_fbt/references/Ortega2013.txt

Thermodynamics as a theory of decision-making
with information processing costs

Pedro A. Ortega and Daniel A. Braun

July 31, 2012

Abstract

Perfectly rational decision-makers maximize expected utility, but cru-
cially ignore the resource costs incurred when determining optimal actions.
Here we propose an information-theoretic formalization of bounded ratio-
nal decision-making where decision-makers trade off expected utility and
information processing costs. Such bounded rational decision-makers can
be thought of as thermodynamic machines that undergo physical state
changes when they compute. Their behavior is governed by a free en-
ergy functional that trades off changes in internal energy—as a proxy for
utility—and entropic changes representing computational costs induced
by changing states. As a result, the bounded rational decision-making
problem can be rephrased in terms of well-known concepts from statis-
tical physics.
In the limit when computational costs are ignored, the
maximum expected utility principle is recovered. We discuss the relation
to satisficing decision-making procedures as well as links to existing theo-
retical frameworks and human decision-making experiments that describe
deviations from expected utility theory. Since most of the mathematical
machinery can be borrowed from statistical physics, the main contribution
is to axiomatically derive and interpret the thermodynamic free energy as
a model of bounded rational decision-making.

1
Introduction

In everyday life decision-makers often have to make fast and frugal choices
[1, 2]. Consider, for example, an antelope that quickly has to choose a direction
of flight when faced with a predator. By the time an antelope had considered
all possible flight paths to determine the optimal one, it would most probably
be already eaten. In general, decision-makers seem to trade off the expected
desirability of the consequences of an action against the effort and resources
(time, money, food, computational effort, knowledge, opportunity costs, etc.)
required for searching the optimum [3, 4].
Classic theories of decision making generally ignore information-processing
costs by assuming that decision makers always pick the option with maximum

1


return—irrespective of the effort or the resources it might take to find or com-
pute the optimal action [5, 6, 7]. Such decision-makers are described as perfectly
rational. However, being perfectly rational seems to contradict our intuition of
real-world decision-making, where information processing constraints play an
important role [1]. This has led to an abundant literature on bounded rational-
ity [8, 9, 10, 11]. Unlike perfectly rational decision makers, bounded rational
decision-makers are subject to information processing constraints, that is they
may have limited time and speed to process a limited amount of information.

1.1
Thermodynamic Intuition

a)
b)

pV

(1 − p)V

A

B

Figure 1: The Molecule-In-A-Box Device.
(a) Initially, the molecule moves
freely within a space of volume V delimited by two pistons. The compartments
A and B correspond to the two logical states of the device. (b) Then, the lower
piston pushes the molecule into part A having volume V ′ = pV .

Here we follow a thermodynamic argument [12] that allows measuring re-
source (or information) costs in physical systems in units of energy. The gen-
erality of the argument relies on the fact that ultimately any real agent has to
be incarnated in a physical system, as any process of information processing
must always be accompanied by a pertinent physical process [13]. In the fol-
lowing we conceive of information processing as changes in information states
(i.e. ultimately changes of probability distributions), which consequently im-
plies changes in physical states, such as flipping gates in a transistor, changing
voltage on a microchip, or even changing location of a gas particle. Such state
changes in physical systems are not for free, that is the do not happen sponta-
neously. Consequently, if we want to control a physical system into a desirable
state we also have to take into consideration that changing from the current
state to the desirable state incurs a cost.
According to Landauer’s principle, one can postulate a formal correspon-
dence between one unit of information and one unit of energy [14, 15, 16].
Consider representing one bit of information using one of the following logical
devices: a molecule that can be located either on the top or the bottom part of

2


a box; a coin whose face-up side can be either head or tail; a door that can be
either open or closed; a train that can be orientated facing either north or south;
and so forth. Assume that all these devices are initialized in an undetermined
logical state, where the first state has probability p and the second probability
1 − p. Now, imagine you want to set these devices to their first logical state.
In the case of the molecule in a box, this means the following. Initially, the
molecule is uniformly moving around within a space confined by two pistons as
depicted in Figure 1a. Assuming that the initial volume is V , the molecule has
to be pushed by the lower piston into the upper part of the box having volume
V ′ = pV (Figure 1b). From information theory, we know that the number of
bits that we fix by this operation is given by − log p.
To make things concrete, we assume that the device has diathermal walls
and is immersed in a heat bath at constant temperature T . Since the walls are
diathermal, the temperature inside of the box is maintained at the temperature
of the heat bath. We model the particle as an ideal gas. When an ideal gas
is compressed under isothermal conditions from an initial volume V to a final
volume V ′, then the work is calculated as

W = −
� V ′

V

NkT

V
dV = NkT ln V

V ′ ,
(1)

where N ≥ 0 is the amount of substance and k > 0 is the Boltzmann constant.
The minus sign is just a convention to denote work done by the piston rather
than by the gas. If we assume N = 1 and make use of the fact that V ′ = pV
we get

W = kT ln V

pV = −kT ln p = − kT

log e log p = −γmol log p,

where the constant γmol :=
RT
log e > 0 can be interpreted as the conversion factor
between one unit of information and one unit of energy for the molecule-in-a-box
device.
How do we compute the information and work for the case of the coin, door
and train devices? The important observation is that we can model these cases
as if they were like molecule-in-a-box devices, with the difference that their
conversion factors between units of information and units of work are different.
Hence, the number of bits fixed while these devices are set to the first state is
given by − log p, i.e. exactly as in the case of the molecule. However, the work
is given by
−γcoin log p,
−γdoor log p,
and
− γtrain log p

respectively, where γcoin, γdoor and γtrain are the associated conversion factors
between units of information. Obviously, γmol ≤ γcoin ≤ γdoor ≤ γtrain. The
point is that changes in knowledge states are costly and that these costs are
proportional to the information. In the next section, we derive a general ex-
pression of information costs in physical systems that make decisions.

3


2
Information-Theoretic Foundations

2.1
Resource Costs

We model any observable sequential process, such as a sequence of interactions
or a sequence of computation steps, as a filtration on a measure space.
To
simplify our exposition, we consider only finite measure spaces.
Let (Ω, Σ)
denote a measurable space, where Ω denotes the sample space and where Σ is
a σ-algebra on Ω. Let p be a conditional probability measure on (Ω, Σ), such
that for any two events A, B ∈ Σ, p(A|B) denotes the conditional probability of
the A given B, where the condition B plays the role of the current information
state of the process. The sequential realization of a process is modelled as a
sequence of conditions A1, A2, . . . , AT on the sample space Ω, where each new
condition At refines the current information state �
τ≤t Aτ by excluding the
complement A∁
t .
We further assume that a transformation of an information state from B to
(A ∩ B) entails a cost ρ(A|B) that could be measured in dollars, time or any
arbitrary scale of effort. Moreover, we assume that this transformation cost is
decomposable; that is, if we undergo a knowledge change from C to (A∩B ∩C),
then we should pay the same cost as undergoing a change first from C to (B∩C)
and then from (B ∩ C) to (A ∩ B ∩ C). Finally, the quintessential information-
theoretic postulate is that conditional probabilities impose a monotonic order
over transformation costs1. We can sum up our postulates as follows:

Definition 1 (Axioms of Transformation Costs). Let (Ω, Σ) be a measurable
space and let p : (Σ × Σ) → [0, 1] be a conditional probability measure over
Σ (i.e. for any A ∈ Σ, p(·|A) is a probability measure over A). A function
ρ : (Σ × Σ) → R+ is a transformation cost function for p iff it has the following
three properties for all events A, B, C, D ∈ Σ:

A1. real-valued:
∃f,
ρ(A|B) = f
�
p(A|B)
�
∈ R,
A2. additive:
ρ(A ∩ B|C) = ρ(A|C) + ρ(B|A ∩ C),

A3. monotonic:
[ρ(A|B) > ρ(C|D)]
⇔
[p(A|B) ≶ p(C|D)].

These three properties enforce a strict correspondence between probabilities
and transformation costs [18, 19].

Theorem 1 (Transformation Costs ↔ Probabilities). If f is such that ρ(A|B) =
f(p(A|B)) for every choice of the probability space (Ω, Σ, p), then f is of the form

f(·) = − 1

β log(·),

where β is a real parameter.

1This intuition is central for optimal coding theory where short codewords are assigned
to frequent events and long codewords are assigned to rare events [17]. Therefore, we could
regard the codeword length as a valuable resource that we have to bet on events with different
probabilities.

4


That is, the transformation cost ρ(A|B) is proportional to the information
content − log p(A|B), where the parameter β plays the role of the conversion
factor.
The logarithmic mapping between probabilities and “costs” is well-
known in information theory, and there are many possible ways to derive it
[20, 21]. The important observation is that our derivation stems purely from
postulates regarding transformation costs.
According to Definition 1, transformation costs measure the relative cost of
an event relative to a reference event. However, we can also introduce an absolute
cost measure to single events such that transformation costs are obtained as
differences.

Definition 2 (Potential). Let ρ be a transformation cost function. A set func-
tion φ : Σ → R is called a cost potential for ρ iff for all A, B ∈ Σ,

φ(Ω) := φ0
φ(A ∩ B) := φ(B) + ρ(A|B)
∀A, B ∈ Σ,

where φ0 is an arbitrary real value.

One can easily verify that this potential is well defined for all events, and
that ρ(A|B) = φ(A ∩ B) − φ(B). It captures the intuition that starting out
from the high-probability event B with potential φ(B) one has to pay the cost
ρ(A|B) to arrive at the low-probability event A ∩ B with potential φ(A ∩ B).
In the following, consider a reference set S ∈ Σ having a measurable parti-
tion X. Cost potentials have an important recursive structure: the cost potential
of an event is uniquely determined by the potential of its constituent events.
If X is a measurable partition of a reference event S ∈ Σ, then

φ(S) = − 1

β log
�

x∈X
e−βφ(x).
(2)

Furthermore, the probability of a member x ∈ X of the partition relative to S
can be expressed as a Gibbs measure:

p(x|S) = e−βφ(x)

e−βφ(S) =
e−βφ(x)

�
x∈X e−βφ(x) .
(3)

In statistical physics it is well-known that the Gibbs measure satisfies a varia-
tional principle in the free energy, which is defined as

Fβ[q] :=
�

x∈X
q(x)φ(x) + 1

β

�

x∈X
q(x) log q(x).
(4)

More specifically, it is well known that for any probability measure q over the
partition X of S,

F[q] ≥ F[p] = − 1

β log φ(S),
(5)

where the lower bound is attained by the Gibbs measure p(x) ∝ e−βφ(x). Equa-
tions (2) to (5) constitute fundamental results that will be generalized and
interpreted in the next section.

5


2.2
Gains and Losses

Equipped with the results from the preceding section, we can now proceed to
model a bounded rational decision maker. Because transformation costs matter,
we model a decision as a transformation of a prior behavior into a final behavior,
where we represent the direction of change as a utility criterion.
The Gibbs measure in (3) allows us describing a probability measure p over
a partition X in terms of a cost potential φ over X. In particular, we see that a
decision-maker’s a priori behavior or belief described by p0(x) and φ0(x) changes
to p(x) and φ(x) if he is exposed to the gain (or loss) U(x), such that

φ(x) = φ0(x) − U(x)
(6)

and
p(x) ∝ e−βφ0(x)+βU(x) ∝ p0(x)eβU(x)
(7)

as illustrated in Figure 1. The function U represents either gains or losses and
not absolute levels of costs, because it expresses a difference in the potential
U(x) = φ0(x)−φ(x). The equilibrium distribution (7) that arises in a change can
also be characterized in terms of a variational principle, in a manner analogous
to (5).

Theorem 2 (Negative Free Energy Difference). Let p0(x) and p(x) be the Gibbs
measures with potentials φ0(x) and φ(x) and resource parameter β. Let F0 and
F be the free energies minimized by p0 and p respectively. Then, the negative
free energy difference −∆F = F0 − F is

−∆F =
�

x∈X
p(x)U(x) − 1

β

�

x∈X
p(x) log p(x)

p0(x),
(8)

where U(x) = φ0(x) − φ(x).

Since the difference in the negative free energy −∆F = F − F0 has the same
dependency on p as the free energy F, we can use −∆F directly as a variational
principle in p.

Corollary 3 (Variational Principle). The negative free energy difference pro-
vides a variational principle for the equilibrium distribution, i.e.

−∆F[q] :=
�

x∈X
q(x)U(x) − 1

β

�

x∈X
q(x) log q(x)

p0(x)

is maximized by

p(x) = 1

Z p0(x)eβU(x),
where
Z :=
�

x∈X
eβU(x).

Furthermore,
∆F[q] ≤ ∆F[p] = 1

β log Z.

6


φ0

−U

φ = φ0 − U

Initial
Final

low

high

Probability

Figure 2: Representing a decision maker as a thermodynamic system, the be-
havior of the decision-maker exposed to a gain U can be expressed as a change
of his initial cost potential φ0 to a final cost potential φ, where φ = φ0 − U.
The choice or belief probabilities of the decision-maker change according to (7)
from p0 to p.

2.3
Choice & Belief Probabilities

The distribution (7) can be interpreted both as an action or observation prob-
ability in the context of bounded rational decision-making. In the case of ac-
tions, p0 represents the a priori choice probability of the agent which is refined
to the choice probability p when evaluating the imposed gain (or loss) U. The
associated change in probability depends on the resource parameter β and cor-
responds to the computation that is necessary to evaluate the gains (or losses).
In the case of observations, p0 represents the a priori belief of the agent given
by a probabilistic model, which is then distorted due to the presence of possible
gains (or losses) that are evaluated by the holder of the belief. This way, model
uncertainty and risk-aversion can be parameterized by β.
For different values of β the distribution (7) has the following limits

lim
β→∞ p(x)
=
δ(x − x∗),
x∗ = max
x
U(x)

lim
β→0 p(x)
=
p0(x)

lim
β→−∞ p(x)
=
δ(x − x∗),
x∗ = min
x U(x).

In the case of actions the three limits imply the following: The limit β → ∞
corresponds to the perfectly rational actor that infallibly selects the action that
maximizes gain (or minimizes loss −U(x). The limit β → 0 is an actor without
resources that simply selects his action according to his prior. The limit β →
−∞ corresponds to an actor that is perfectly “anti-rational” and always selects
the action with the worst outcome. In the case of observations the three limits
correspond to an extremely optimistic observer (β → ∞) who believes only in
the best possible outcome, an extremely pessimistic observer (β → −∞) who
anticipates only the worst, and a risk-neutral Bayesian observer (β → 0) who
simply relies on the probabilistic model p0.

7


2.4
The Certainty Equivalent

In statistical physics [22], the free energy difference

∆A = ∆E − Q = W

measures the amount of available “good energy” (work W) by subtracting the
“bad energy” (heat Q) from the total energy ∆E = E[U]. The crucial physical
intuition is that we have uncertainty about some aspects of the objects that
make up the heat energy, for example we do not know the exact trajectories
of all gas particles at temperature β. This uncertainty means that we do not
have full control over the objects and cannot extract all the energy as work
[12]. Economically speaking, the physical concept of work, and therefore also
the difference in free energy, measures the certainty equivalent of a gain (or
loss) that is contaminated by uncertainty. In general, we can therefore use the
free energy difference to ascribe a certainty equivalent value to choice situations
of the form (7). As can be seen from Corollary 3, this value is given by the
log partition function, i.e. the logarithm of the normalization constant Z. For
different values of β, the certainty equivalent takes the following limits

lim
β→∞
1
β log Z
=
max
x
U(x)

lim
β→0
1
β log Z
=
�

x
p0(x)U(x)

lim
β→−∞
1
β log Z
=
min
x U(x).

Again, the case β → ∞ corresponds to the perfectly rational actor (or the
extremely optimistic observer), the case β → −∞ corresponds to the perfectly
“anti-rational” actor (or the extremely pessimistic observer) and the case β → 0
corresponds to the actor that has no resources (or the risk-neutral observer) such
that the best one can expect is the expected gain or loss.
Corollary 3 has two interpretations in statistical physics, either as an in-
stantiation of a minimum energy principle or as a maximum entropy principle
[22]. Accordingly, (7) can either be seen as the distribution that maximizes the
entropy given a constraint on the expectation value of U or as the distribution
that minimizes the expectation of −U given a constraint on the entropy of p.
In the context of observer modeling, the first interpretation provides a principle
for estimation and the second interpretation provides a principle for bounded
rational decision-making in the case of acting, which is a maximum expected
gain principle with a relative entropy constraint that bounds the information-
processing capacity of the decision-maker. In the relative entropy we recognize
the term 1

β log p(x) as our transformation costs ρ from Theorem 1 such that we
can express the negative free energy difference −∆F as

−∆F = E[U] − E[R],

where U(x) = φ0(x)− φ(x) represents gains (or losses) and R(x) = ρ(x)− ρ0(x)
represents the extra resource costs required to achieve the gain (or loss) U.

8


We can therefore see how the variational principle of Corollary 3 formalizes a
trade-off between expected gains (or losses) and information processing costs.

3
Summary of Main Concepts

In decision theory, choices between alternatives are usually formalized as choices
between lotteries, where a lottery is formalized as a set X of possible out-
comes, a probability distribution p0 over X, and a real-valued function U over
X called the utility function. In particular expected utility theory predicts that
a decision-maker always chooses the lottery with the higher expected utility
E[U] = �
x p0(x)U(x). Here we introduce the notion of a bounded lottery as a
lottery that is additionally characterized by a resource parameter β ∈ R that
captures the resource constraints of the decision-maker.
We have derived a thermodynamic framework for bounded lotteries from
simple axioms that measure information processing cost—see also [19].
The
most important difference of bounded decision-making compared to perfectly
rational decision-making is that the bounded decision-maker will not be able
to choose infallibly the best lottery. In fact, the resource constraints lead to
stochastic choice behavior which can be characterized by a probability distribu-
tion. The decision process then transforms an initial choice probability p0 into
a final choice probability p by taking into account the utility gains (or losses)
and the transformation costs. This transformation process can be formalized as

p(x) = 1

Z p0(x)eβU(x),
where
Z =
�

x′
p0(x′)eβU(x′).
(9)

Accordingly, the choice pattern of the decision-maker is predicted by the prob-
ability p. Crucially, the probability p extremizes the variational principle

max
p

� �

x
p(x)U(x) − 1

β

�

x
p(x) log p(x)

p0(x)

�
.
(10)

These two terms can be interpreted as determinants of bounded rational decision-
making in that they formalize a trade-off between an expected utility gain (first
term) and the information processing cost of transforming p0 into p (second
term). The certainty equivalent value of a bounded lottery can be obtained by
inserting the choice probability p from (9) into (10), yielding

V = 1

β log
��

x
p0(x)eβU(x)
�
,
(11)

which corresponds to the log partition sum. For different values of β, the cer-

9


a)

b)

Umax

Umin

E[U]
β

β1

β1

β2

β2

β3

β3

Figure 3: a) Negative free energy difference ∆F versus the resource parame-
ter β. The resource parameter allows modeling decision-makers with bounded
resources, either when generating their own actions (β > 0) or when anticipat-
ing their environment (β < 0). The negative free energy difference corresponds
to the certainty equivalent. b) Distribution over the outcomes depending on
the resource parameter β. For large positive β the distribution concentrates on
the outcome with maximum gain Umax. For large negative β the distribution
concentrates on the worst outcome with gain Umin. For β = 0 the outcomes
follow the given distribution p0.

tainty equivalent takes the following limits

lim
β→∞
1
β log Z
=
max
x
U(x)

lim
β→0
1
β log Z
=
�

x
p0(x)U(x)

lim
β→−∞
1
β log Z
=
min
x U(x).

The case β → ∞ corresponds to the perfectly rational actor (or the extremely
optimistic observer), the case β → −∞ corresponds to the perfectly “anti-
rational” actor (or the extremely pessimistic observer) and the case β → 0
corresponds to the actor that has no resources (or the risk-neutral observer)
such that the best one can expect is the expected gain or loss. For illustration
see Figure 2.

10


4
Bounded Rationality and Satisficing

Herbert Simon [23] proposed in the 50s that bounded rational decision-makers
do not commit to an unlimited optimization by searching for the absolute best
option. Rather, they follow a strategy of satisficing, i.e. they settle for an option
that is good enough in some sense. Since then, it has been debated whether sat-
isficing decision-makers can be described as bounded rational decision-makers
that act optimally under resource constraints or whether optimization is the
wrong concept altogether [11]. If decision-makers did indeed explicitly attempt
to solve such a constrained optimization problem, this would lead to an infinite
regress and the paradoxical situation that a bounded rational decision-maker
would have to solve a more complex (i.e. constrained) optimization problem
than a perfectly rational decision-maker.
To resolve this paradox, the bounded rational decision maker must not be
able to reason about his constraints. He just searches randomly for the best
option, until his resources run out. An observer will then be able to assign a
probability distribution to the decision-maker’s choices and investigate how this
probability distribution changes depending on the available resources. Consider,
for example, an anytime algorithm that will compute a solution more and more
precisely the more time it has at its disposal. As one does not want to wait
forever for an answer, the anytime computation will be interrupted at some
point where one assumes that the answer is going to be good enough. This
concept of satisficing can be used to interpret Equation 7 which describes the
choice rule of a bounded rational decision-maker.
Consider the problem of picking the largest number in a sequence U0, U1, U2, . . .
of i.i.d. data, where each Ui ∈ U is drawn from a source with probability dis-
tribution µ. This could be, for instance, an urn with numbered balls that we
draw with replacement and we always keep track of the largest number seen so
far. After m draws the largest number will be given by

v := max{U1, U2, . . . , Um}.

Naturally, the larger the number of draws, the higher the chances of observing a
large number. The cumulative distribution function of choosing v after m draws
is given by
Fm(v) = F0(v)m,
(12)

where F0 is the cumulative distribution function of µ [24]. If we only cared about
finding the maximum with absolute certainty then we would need to draw an
infinite amount of times. However, a bounded rational decision-maker would
stop after a certain time, when he feels that the benefit of further exploration
does not justify the effort of further drawings. Thus, the number of draws in
this example can be regarded as a resource and the numbers on the balls can
be regarded as utilities. The behavior of the bounded rational decision-maker
is then stochastic even though he acts perfectly deterministically, in the sense
that he chooses option v with probability (12) given the resource constraint
m. According to (12), the more resources a decision-maker spends, the more he

11


a)
b)

M = 0
M = 8

M = 32
M = 128

Umax

0
M

E[v]

E[v] − M · c

Figure 4: a) Distributions over the maximum for various sample sizes (M +
1). The distribution µ over the ten values v in U = {1, 2, 3, . . ., 10} follows a
truncated Poisson distribution with parameter λ = 5, as can be seen in the
plot for M = 0. The distribution approaches a delta function over v = 10 for
increasing values of M. b) The expected maximum v versus sample size (M +1).
The incremental gain of the expected maximum is marginally decreasing as the
sample size increases (red). If the sampling process is associated with a cost—
e.g. c = 0.02 per sample in the figure—, then the penalized expected maximum
(black) reaches a unique maximum for a finite sample size—the optimal sample
size is M = 35 in the figure.

resembles a perfectly rational decision-maker that chooses the maximum number
(Figure 1a), since the expected utility increases monotonically with the amount
of resources spent (Figure 1b). Importantly, however, note that the marginal
increase in the expected utility diminishes with larger effort—hence larger and
larger effort pays out less and less in the end. Below we formalize this trade-off.
Here we show that the boundedness parameter β plays an analogous role to
the number of draws m. In the limit of a continuous cumulative function F0,
the density after m draws is given by pm(v) =
d
dvF0(v)m. We can now compute
the log odds for two random outcomes v and v′, which results in

log pm(v)

pm(v′) = (m − 1) log F0(v)

F0(v′) + log µ(v)

µ(v′),

where F0(v) is again the cumulative of µ. If we require the probabilities pm(v) to
be representable by a distribution of the exponential family such that pm(v) =
µ(v) exp(αU(v))
�
dv′µ(v′) exp(αU(v′)), we see that the log odds have the following relation

log pm(v)

pm(v′) = α (U(v) − U(v′)) + log µ(v)

µ(v′).

We see that α and m play the role of the number of samples or computations.
In general, the following theorem can be shown to hold.

12


Theorem 4. Let X be a finite set. Let Q and M be strictly positive probability
distributions over X. Let α be a positive integer. Define Mα as the probability
distribution over the maximum of α samples from M. Then, there are strictly
positive constants δ and ξ depending only on M such that for all α,
����
Q(x)eαU(x)

�

x′ Q(x′)eαU(x′) − Mα(x)
���� ≤ e−(α−ξ)δ.

Consequently, one can interpret the inverse temperature as a resource param-
eter that determines how many samples are drawn to estimate the maximum.
Note that the distribution M is arbitrary as long as it has the same support
as Q.
This interpretation can be extended to a negative α, by noting that
αU(x) = (−α)(−U(x)), i.e. instead of the maximum we take the minimum of
−α samples.

5
Sequential Decision-Making

In the case of sequential decision-making the assumption of uniform temper-
atures has to be relaxed—the proofs of the following theorems can be found
in [25]. In general, we can then dedicate different amounts of computational
resources to each node of a decision tree. However, this requires a translation
between a tree with a single temperature and to a tree with different tempera-
tures. This translation can be achieved using the following theorem

Theorem 5. Let P be the equilibrium distribution for a given inverse tem-
perature α, utility function U and reference distribution Q. If the temperature
changes to β while keeping P and Q fixed, then the utility function changes to

V (x) = U(x) −
�
1
α − 1

β
�
log P(x)

Q(x).

If we now define the reward as the change in utility of two subsequent nodes,
then the rewards of the resulting decision tree are given by

R(xt|x<t)
:=
�
V (x≤t) − V (x<t)
�

=
�
U(x≤t) − U(x<t)
�
−
�
1
α −
1

β(x<t)
�
log P(xt|x<t)

Q(xt|x<t).

This allows introducing a collection of node-specific (not necessarily time-specific)
inverse temperatures β(x<t), allowing for a greater degree of flexibility in the
representation of information costs. The next theorem states the connection
between the free energy and the general decision tree formulation.

Theorem 6. The free energy of the whole trajectory can be rewritten in terms

13


of rewards:

Fα[P] =
�

x≤T
P(x≤T )
�
U(x≤T ) − 1

α log P(x≤T )

Q(x≤T )

�

= U(ε) +
�

x≤T
P(x≤T )

T
�

t=1

�
R(xt|x<t) −
1

β(x<t) log P(xt|x<t)

Q(xt|x<t)

�
.
(13)

This translation allows applying the free energy principle to each node with
a different resource parameter β(x<t).
By writing out the sum in (13), one
realizes that this free energy has a nested structure where the latest time step
forms the innermost variational problem and all other variational problems of
the previous time steps can be solved recursively by working backwards in time.
This then leads to the following solution:

Theorem 7. The solution to the free energy in terms of rewards is given by

P(xt|x<t) =
1

Z(x<t)Q(xt|x<t) exp
�
β(x<t)
�
R(xt|x<t) +
1

β(x≤t) log Z(x≤t)
��
,

where Z(x≤T ) = 1 and where for all t < T

Z(x<t) =
�

xt
Q(xt|x<t) exp
�
β(x<t)
�
R(xt|x<t) +
1

β(x≤t) log Z(x≤t)
��
.

6
Limit Cases of Bounded Rational Control

As described in the previous section, the belief and action probabilities of an
agent in a sequential decision-making setup can be determined by recursion of
the log-partition function

V (x<t) =
1

β(x<t) log
��

xt
Q(xt|x<t) exp
�
β(x<t)
�
R(xt|x<t) + V (x≤t)
���
,

(14)
where we have introduced V (x≤t) =
1

β(x≤t) log Z(x≤t). If xt is an action variable
then Q(xt|x<t) reflects the prior policy and the agent’s rationality β(x<t) deter-
mines in how far the value R(xt|x<t)+V (x≤t) can be optimized by the agent. If
xt is an observation variable then Q(xt|x<t) reflects the agent’s prior belief and
the rationality of the environment β(x<t) indicates how much one should de-
viate from the prior belief considering the possible values R(xt|x<t) + V (x≤t).
Depending on β(x<t), different decision-making schemes can be recovered—
compare Figure 3.

1. KL control. When assuming a history-independent loss function r(xt),
Markov probabilities p0(xt|xt−1) and β(x<t) = β for all x<t, Equation (14)
simplifies to a recursion that is equivalent to z-iteration which has previ-
ously been suggested in [26, 27] to approximately solve MDPs by means

14


β1

β2

+∞

−∞

+∞
−∞

Anti-
Rational
Irrational
Rational

Risk-
Seeking

Risk-
Neutral

Risk-
Averse

Figure 5: Schematic illustration of how resource parameters can model a range of
decision-making schemes: (1)- risk-seeking, bounded rational; (2) risk-neutral,
perfectly rational; (3) risk-averse, perfectly rational; and (4) robust, perfectly
rational.

of linear algebra—see [28, 29] for details of this equivalence relation. In
[26, 27] the transition probabilities of the MDP are controlled directly
and the control costs are given by the Kullback-Leiber divergence of the
manipulated state transition probabilities with respect to a baseline distri-
bution that describes the passive dynamics. In our framework, this kind
of KL control corresponds to the special case where all random variables
are action variables and the agent has boundedness parameter β. The
stochasticity in this case, however, is not thought to arise from environ-
mental passive dynamics, but rather is a direct consequence of bounded
rational control in a (possibly) deterministic environment. The continuous
case of KL control relies on the formalism of path integrals [30, 31], but
essentially the same relation to bounded rationality can be established—
see [28] for details.

2. Optimal stochastic control. When assuming β(x<t) → ∞ for all action
variables and β(x<t) → 0 for all observation variables, we approach the
limit of the perfectly rational decision-maker in a stochastic environment.
In this limit, the log-partition function converges to the expected utility
and the decision-maker acts deterministically so as to maximize the ex-
pected utility. For action variables, recursion (14) becomes the well-known
Bellman Optimality Equation [32]—see [28, 29] for details.

3. Risk-sensitive control.
Risk-sensitive control [33] corresponds to a
decision-maker with β(x<t) → ∞ for all action variables and β(x<t) ̸= 0
for observation variables.
Risk-sensitivity in the context of continuous
KL control has been previously proposed in [34].
Mean-variance deci-

15


sion criteria used in finance can be equally derived [35]. Risk-sensitive
decision-makers do not simply maximize the expectation of the utility,
but also consider higher-order cumulants by optimizing a stress function
given by the log partition sum. A risk-averse decision-maker (β(x<t) < 0),
for example, discounts variability off the expected utility.
In contrast,
risk-seeking decision-makers (β(x<t) > 0) add value to the expected util-
ity in the face of variability. Risk-sensitivity biases the beliefs about the
environment optimistically (collaborative environment) or pessimistically
(adversarial environment). Alternatively, one could regard a collabora-
tive environment also as a bounded rational controller that can choose
its own observation—that is the environment behaves like an extension of
the agent with partial control. Importantly, the stress function is typically
assumed in risk-sensitive control schemes in the literature, whereas here
it falls out naturally—see [29] for more details.

4. Robust control.
When assuming β(x<t) → ∞ for all action vari-
ables and β(x<t) → −∞ for all observation variables, we approach the
limit of the robust decision-maker in an unknown environment. When
β(x<t) → −∞, the decision-maker makes a worst case assumption about
the environment, namely that it is strictly adversarial and perfectly ra-
tional. This leads to the well-known game-theoretic minimax problem.
Minimax problems have been used to reformulate robust control problems
that allow controllers to cope with model uncertainties [36, 37]. Robust
control problems have long been known to be related to risk-sensitive con-
trol [38, 39]. Here we derived both control types from the same variational
principle—see [29] for more details.

7
Discussion

In the proposed thermodynamic interpretation of bounded rationality, agents
with limited resources search for a maximum over a set by randomly drawing
elements from this set. This random search leads to a utility function that is
marginally decreasing when more search effort is allocated. When such agents
pay a search cost, the bounded rational optimum is to abort the search as
soon as the marginal returns are equal to the search cost. The resulting trade-
off between utility maximization and resource costs can be quantified by the
Kullback-Leibler divergence with respect to an initial policy or belief. This ini-
tial probability distribution corresponds to the initial state of a thermodynamic
system that changes when a new potential is imposed. The difference in the po-
tential corresponds to utility gains or losses in economic choice. The difference
in the free energy corresponds to physical work and the economic certainty-
equivalent. Thus, gains or losses that are associated with uncertainty are effec-
tively devalued or overvalued, depending on the sign of the resource parameter.
This way risk-sensitivity, robustness to model uncertainty and game-theoretic
minimax-strategies can arise naturally.

16


Bounded rationality.
Starting with Simon [23], bounded rationality has
been extensively studied in psychology, economics, political science, industrial
organization, computer science and artificial intelligence research—see for ex-
ample [40, 41, 42, 43, 10, 11, 44, 45]. Additionally, numerous experiments in
behavioral economics have shown that humans systematically violate perfect
rationality, that is they are bounded rational [46]. Probably the most closely
related approach to bounded rationality with respect to the present article is
quantal response equilibrium (QRE) game theory [47, 48, 49, 50]. QRE models
assume bounded rational players whose choice probabilities are given by the
Boltzmann distribution and whose rationality is determined by a temperature
parameter. Interactions of such bounded rational players can lead to game-
theoretic solutions that deviate from the Nash equilibrium. The QRE model
is a special case of the model presented here where all prior probabilities are
assumed to be uniform. These prior probabilities are crucial when defining the
certainty-equivalent that ranges from minimum to maximum via the expected
utility. As the certainty-equivalent corresponds to physical work, this also al-
lows to relate bounded rational decision-making to thermodynamic processes.
The distinction of a prior policy and a utility that is optimized to some extent
is fundamental to the notion of bounded rationality proposed in this paper and
therefore also affords a qualitative advance of the bounded rationality model in
QRE models.

Information theory in control and game theory.
As already discussed,
a number of papers have suggested the use of the relative entropy as a cost
function for control [26, 27, 51, 52]. Previously, Saridis [53] has framed optimal
and adaptive control as entropy minimization problems. Statistical physics has
also served as an inspiration to a number of other studies, for example, to an
information-theoretic approach to interactive learning [54], to use information
theory to approximate joint strategies in games with bounded rational players
[55] and to the problem of optimal search [56, 57], where the utility losses
correspond directly to search effort.
Recently, Tishby and Polani [58] have
shown how to apply information theory to understand information flow in the
action-observation cycle. The contribution of our study is to devise information-
theoretic axioms to quantify search costs in bounded optimization problems.
This allows for a unified treatment of control and game-theoretic problems, as
well as estimation and learning problems for both perfectly rational and bounded
rational agents. In the future it will be interesting to relate the thermodynamic
resource costs of bounded rational agents to more traditional notions of resource
costs in computer science like space and time requirements of algorithms [59].

Variational Preferences.
In the economic literature the Kullback-Leibler
divergence has appeared in the context of multiplier preference models that can
deal with model uncertainty [37]. Especially, it has been proposed that a bound
on the Kullback-Leibler divergence could be used to indicate how much of a
deviation from a proposed model p0 is allowed when computing robust decision

17


strategies that work under a range of models in the neighborhood of p0. In
variational preference models [60] this is generalized to models of the form

f ⪰ g ⇐⇒ min
p

��
u(f)dp + c(p)
�
≥ min
p

��
u(g)dp + c(p)
�
,

where c(p) can be interpreted as an ambiguity index that can explain effects of
ambiguity aversion. The thermodynamic certainty-equivalent of work—computed
as the log-partition sum—also falls within this preference model. However, an
important difference is that the choice in a thermodynamic system is not de-
terministic with respect to the certainty-equivalent, but stochastic following
a generalized Boltzmann distribution. Due to this stochasticity of the choice
behavior itself, the thermodynamic model can be linked to both bounded ratio-
nality and model uncertainty, whereas variational preference models have so far
concentrated on explaining effects of ambiguity aversion and model uncertainty.

Ellsberg’s and Allais’ paradox.
Two of the most famous deviations from
expected utility theory that have been consistently observed in human decision-
making are the paradoxa of Ellsberg [61] and Allais [62]. While the first paradox
has encouraged a large literature dealing with model uncertainty [37], the latter
paradox has led to the development of prospect theory [63, 64]. Ellsberg could
show that human choice in the face of ambiguity differs from decision-making
under risk where precise probability models are available. Humans typically
tend to avoid ambiguous options, rather than choosing the option with higher
expected utility. The observed ambiguity aversion can be modeled straightfor-
wardly by a bounded rational decision-maker by allowing some degree of mini-
maxing in the spirit of a risk-sensitive controller—see Supplementary Material
for details. Allais could show that humans frequently reverse their preferences
in choice tasks that may not lead to preference reversals according to expected
utility theory. These reversals typically occur for different levels of riskiness of
the same choices. The explanation of the Allais paradox within the framework of
bounded rationality is not as straightforward as the Ellsberg paradox, but may
involve context-dependent changes of the boundedness parameter or biases in
the decision-making process that lead to a generalized quasi- linear mean model
[65, 66, 67, 68], which provides an alternative account of preference reversals
of the Allais type without violating the principle of stochastic dominance—see
Supplementary Material for more details.

Stochastic Choice.
Stochastic choice rules have been extensively studied in
the psychological and econometric literature, in particular logit choice mod-
els based on the Boltzmann distribution [69, 70]. The literature on Boltzmann
distributions for decision-making goes back to Luce [71], extending through Mc-
Fadden [72], Meginnis [73], Fudenberg [74] and Wolpert [55, 75, 50]. Luce [71]
has studied stochastic choice rules of the form p(xi) ∼
wi
�

j wj , which includes
the Boltzmann distribution and the “softmax”-rule known in the reinforcement
learning literature [76]. McFadden [72] has shown that such distributions can

18


arise, for example, when utilities are contaminated with additive noise follow-
ing an extreme value distribution. While stochastic choice models are generally
accepted to account for human choices better than their deterministic coun-
terparts [77, 78, 79], they have also been strongly criticized, especially for a
property known as independence of irrelevant alternatives (IIA). Similar to the
independence axiom in expected utility theory, IIA implies that the ratio of
two choice probabilities does not depend on the presence of a third irrelevant
alternative in the choice set. What distinguishes the free energy equations from
above choice rules is that stochastic choice behavior is described by a generalized
exponential family distribution of the form p(x) ∼ p0(x) exp(βU(x)). Changing
the choice set might in general also change the prior p0(x), but more importantly
it might also change the resource parameter β.

Diffusion-to-bound models.
Diffusion-to-bound models typically model the
process of binary decision-making as a random walk process that terminates once
it hits one of two given decision bounds [80]. Each time step of the random walk
provides noisy evidence towards one of the two options. This implements a nat-
ural speed-accuracy trade-off: the further away the bounds the more reliable the
decision will be, as the noise can be averaged out, but also the longer one has to
wait. The resulting choice probabilities are identical to the choice probabilities
of a bounded rational decision-maker if we relate the decision bound of the ran-
dom walk with the boundedness parameter in 7—see Supplementary Material
for details. The boundedness parameter can then also be shown to be propor-
tional to the time required for the decision-making process. Decision-to-bound
models have been widely used in behavioral psychology and neuroscience to
explain probabilistic choice and reaction times in psychometric experiments—
see [81] for a review. Decision-makers that apply the decision-to-bound model
may be regarded as bounded rational decision-makers from a normative point
of view.

Free Energy Principle.
A central property of closed thermodynamic sys-
tems is that they minimize free energy. A free energy principle based on the
variational Bayes approach has recently also been proposed as a theoretical
framework to understand brain function [82, 83]. In this framework generative
models of the form p(y|h, a) explain how hidden causes h in the environment and
actions a produce observations y. The brain uses an approximative distribution
Q(h; a) to determine the hidden causes. The free energy

F = −
�
dh Q(h; a) ln P(y, h|a) −
�
dh Q(h; a) ln Q(h; a)

measures how well the brain is doing with this approximation. According to
[82, 83], action and perception consist in choosing a and Q respectively so as to
minimize this free energy. In light of the thermodynamic view of free energy,
maximizing the likelihood − ln P(y, h|a)—or minimizing surprise—is a partic-
ular choice of potential function φ, where the boundedness consists in being

19


restricted to model class Q instead of having full disposal of p(y|h, a). More
generally, variational Bayes methods that use particular classes of distributions
to approximate the posterior could thus be regarded as a form of bounded in-
ference within this picture.

8
Conclusion

Thermodynamics provides a framework for bounded rationality that can be
both descriptive and prescriptive. It is descriptive in the sense that it describes
behavior that is clearly sub-optimal from the point of view of a perfect rational
decision-maker with infinite resources. It is prescriptive in the sense that it
prescribes how a bounded rational actor should behave optimally given resource
constraints formalized by β. As we have argued in this paper, bounded rational
decision-making provides an overarching principle in both senses in economics,
engineering, artificial intelligence, psychology and neuroscience.
A thermodynamic model of bounded rational decision-making has also two
advantages over traditional decision theory of perfect rationality. First, it allows
connecting computational processes with real physical processes, for example
how much entropy they generate and how much energy they require [13]. Sec-
ond, it suggests a notion of intelligence that is closely related to the process
of evolution. It is straightforward to see that bounded rational controllers of
the form (9) share their structure with Bayes’ rule, where we identify the prior
p0(x), the likelihood model eβU(x) and the posterior p(x), normalized by the
partition function, thus, establishing a close link between inference and con-
trol [84]. Furthermore, both bounded rational controllers and Bayes’ rule share
their structural form with discrete replicator dynamics that model evolutionary
processes [85], where samples (a population) are pushed through a fitness func-
tion (likelihood, gain function) that biases the distribution of the population,
thereby transforming a distribution p0 to a new distribution p. In this picture
different hypotheses x compete for probability mass over subsequent iterations,
favoring those x that have a lower-than-average cost. Just like the evolution-
ary random processes of variation and selection created intelligent organisms on
a phylogenetic timescale, similar random processes might underlie (bounded)
intelligent behavior in individuals on an ontogenetic timescale.

9
Acknowledgments

This study was supported by the DFG, Emmy Noether grant BR4164/1-1.

References

[1] Gigerenzer G, Todd PM, ABC Research Group. Simple Heuristics That
Make Us Smart. New York: Oxford University Press; 1999.

20


[2] Gladwell M. Blink: the power of thinking without thinking. New York:
Little, Brown and Company; 2005.

[3] Niv Y, Daw ND, Joel D, Dayan P. Tonic dopamine: opportunity costs
and the control of response vigor.
Psychopharmacology (Berl). 2007
Apr;191(3):507–520.

[4] Daw ND. ‘Model-based reinforcement learning as cognitive search: neu-
rocomputational theories’. In: Cognitive search: evolution algorithms and
the brain. Boston: MIT Press; 2012. .

[5] Von Neumann J, Morgenstern O. Theory of Games and Economic Behavior.
Princeton: Princeton University Press; 1944.

[6] Savage LJ. The Foundations of Statistics. New York: John Wiley and
Sons; 1954.

[7] Fishburn P. The Foundations of Expected Utility. Dordrecht: D. Reidel
Publishing; 1982.

[8] Simon H. Theories of Bounded Rationality. In: Radner CB, Radner R, ed-
itors. Decision and Organization. Amsterdam: North Holland Publ.; 1972.
p. 161–176.

[9] Simon H. Models of Bounded Rationality. Cambridge, MA: MIT Press;
1984.

[10] Rubinstein A. Modeling Bounded Rationality. Cambridge, MA: MIT Press;
1999.

[11] Gigerenzer G, Selten R. In: Bounded rationality: the adaptive toolbox.
Cambridge, MA: MIT Press; 2001. .

[12] Feynman RP. The Feynman Lectures on Computation. Addison-Wesley;
1996.

[13] Tribus M, McIrvine EC. Energy and Information. Scientific American.
1971;225:179–188.

[14] Landauer R. Irreversibility and Heat Generation in the Computing Process.
IBM Journal of Research and Development. 1961;5(3):183–191.

[15] Bennett CH. Logical Reversibility of Computation. IBM Journal of Re-
search and Development. 1973;17(6):525–532.

[16] Bennett CH. The thermodynamics of computationa review. International
Journal of Theoretical Physics. 1982;21(12):905–940.

[17] MacKay DJC. Information Theory, Inference, and Learning Algorithms.
Cambridge University Press; 2003.

21


[18] Ortega PA, Braun DA.
A conversion between utility and information.
In: Proceedings of the third conference on artificial general intelligence.
Atlantis Press; 2010. p. 115–120.

[19] Ortega PA.
A Unified Framework for Resource-Bounded Autonomous
Agents Interacting with Unknown Environments.
Department of Engi-
neering, University of Cambridge, UK; 2011.

[20] Shannon CE. A mathematical theory of communication. Bell System Tech-
nical Journal. 1948 Jul and Oct;27:379–423 and 623–656.

[21] Csisz´ar I. Axiomatic Characterizations of Information Measures. Entropy.
2008;10:261–273.

[22] Callen HB. Thermodynamics and an introduction to thermostatistics. New
York: John Wiley & Sons; 1985.

[23] Simon HA. Rational choice and the structure of the environment. Psycho-
logical Review. 1956;63(2):129–38.

[24] Gumbel EJ. Statistics of Extremes. New York: Columbia University Press;
1958.

[25] Ortega PA, A BD. Free Energy and the Generalized Optimality Equa-
tions for Sequential Decision Making. arXiv:12053997v1 (presented at the
European Workshop for Reinforcement Learning). 2012;.

[26] Todorov E. Linearly solvable Markov decision problems. In: Advances in
Neural Information Processing Systems. vol. 19; 2006. p. 1369–1376.

[27] Todorov E. Efficient computation of optimal actions. Proceedings of the
National Academy of Sciences USA. 2009;106:11478–11483.

[28] Braun DA, Ortega PA. Path integral control and bounded rationality. In:
IEEE Symposium on adaptive dynamic programming and reinforcement
learning; 2011. p. 202–209.

[29] Ortega PA, Braun DA. Information, utility and bounded rationality. In:
Lecture notes on artificial intelligence. vol. 6830; 2011. p. 269–274.

[30] Kappen HJ. A linear theory for control of non-linear stochastic systems.
Physical Review Letters. 2005;95:200201.

[31] Theodorou E, Buchli J, Schaal S.
A generalized path integral ap-
proach to reinforcement learning. Journal of Machine Learning Research.
2010;11:3137–3181.

[32] Bellman RE. Dynamic Programming. Princeton, NJ: Princeton University
Press; 1957.

22


[33] Whittle P.
Risk-sensitive optimal control. New York: John Wiley and
Sons; 1990.

[34] van den Broek JL, Wiegerinck WAJJ, Kappen HJ.
Risk-sensitive path
integral control. In: UAI. vol. 6; 2010. p. 1–8.

[35] Markowitz H. Portfolio Selection. The Journal of Finance. 1952;7:77–91.

[36] Ba¸sar T, Bernhard P.
H-infinity optimal control and related minimax-
design problems: a dynamic game approach. Boston: Birkh¨auser; 1991.

[37] Hansen LP, Sargent TJ.
Robustness.
Princeton: Princeton University
Press; 2008.

[38] Jacobson D. Optimal stochastic linear systems with exponential perfor-
mance criteria and their relation to deterministic differential games. IEEE
T Automat Contr AC. 1973;18:124–131.

[39] Glover K, Boyle J. State-space formulae for all stabilizing controllers that
satisfy an H-norm bound and relations to risk sensitivity. Syst Control
Lett. 1988;11:167–172.

[40] Lipman B. Information Processing and Bounded Rationality: A Survey.
Canadian Journal of Economics. 1995;28(1):42–67.

[41] Russell SJ. Rationality and Intelligence. In: Mellish C, editor. Proceedings
of the Fourteenth International Joint Conference on Artificial Intelligence.
San Francisco: Morgan Kaufmann; 1995. p. 950–957.

[42] Russell SJ, Subramanian D. Provably bounded-optimal agents. Journal of
Artificial Intelligence Research. 1995;3:575–609.

[43] Aumann RJ. Rationality and Bounded Rationality. Games and Economic
Behavior. 1997 October;21(1-2):2–14.

[44] Kahneman D. Maps of Bounded Rationality: Psychology for Behavioral
Economics. American Economic Review. 2003 December;93(5):1449–1475.

[45] Spiegler R.
Bounded Rationality and Industrial Organization.
Oxford:
Oxford University Press; 2011.

[46] Camerer C. Behavioral Game Theory: Experiments in Strategic Interac-
tion. Princeton: Princeton University Press; 2003.

[47] D MR, R PT.
Quantal Response Equilibria for Normal Form Games.
Games and Economic Behavior. 1995 July;10(1):6–38.

[48] Mckelvey R, Palfrey TR. Quantal Response Equilibria for Extensive Form
Games. Experimental Economics. 1998;1:9–41.

[49] Anderson SP, Goeree JK, Holt CA. The logit equilibrium: a perspective on
intuitive behavioral anomalies. Southern Economic Journal. 2002;69:21–47.

23


[50] Wolpert DH, Harre M, Bertschinger N, Olbrich E, Jost J. Hysteresis ef-
fects of changing parameters of noncooperative games. Physical Review E.
2012;85:036102.

[51] Peters J, M¨ulling K, Altun Y. Relative entropy policy search. In: AAAI;
2010. .

[52] Kappen HJ, G´omez V, Opper M. Optimal control as a graphical model
inference problem. Machine Learning. 2012;1:1–11.

[53] Saridis G. Entropy formulation for optimal and adaptive control. IEEE
Transactions on Automatic Control. 1988;33:713–721.

[54] Still S. An information-theoretic approach to interactive learning. Euro-
physics Letters. 2009;85:28005.

[55] Wolpert DH. Information theory - the bridge connecting bounded rational
game theory and statistical physics. In: Braha D, Bar-Yam Y, editors.
Complex Engineering Systems. Perseus Books; 2004. .

[56] Stone LD. Theory of optimal search. New York: Academic Press; 1998.

[57] Jaynes ET. Entropy and search theory. In: Smith CR, Grandy WT, editors.
Maximum entropy and Bayesian methods in inverse problems. Dordrecht:
Reidel; 1985. .

[58] Tishby N, Polani D. Information Theory of Decisions and Actions. In: Vas-
silis T Hussain, editor. Perception-reason-action cycle: Models, algorithms
and systems. Berlin: Springer; 2011. .

[59] Vitanyi PMB. Time, space, and energy in reversible computing. In: Pro-
ceedings of the 2nd ACM conference on Computing frontiers; 2005. p. 435–
444.

[60] Maccheroni F, Marinacci M, Rustichini A.
Ambiguity aversion, robust-
ness, and the variational representation of preferences.
Econometrica.
2006;74:1447–1498.

[61] Ellsberg D. Risk, Ambiguity and the Savage Axioms. The Quaterly Journal
of Economics. 1961;75:643–669.

[62] Allais M. Le comportment de l’homme rationnel devant la risque: critique
des postulats et axiomes de l’ecole Americaine. Econometrica. 1953;21:503–
546.

[63] Kahneman D, Tversky A. Prospect Theory: An Analysis of Decision under
Risk. Econometrica. 1979;47:263–291.

[64] Tversky A, Kahneman D. Advances in prospect theory: Cumulative repre-
sentation of uncertainty”. Journal of Risk and Uncertainty. 1992;5:297–323.

24


[65] Nagumo M. ¨Uber eine Klasse der Mittelwerte. Japan Journal of Mathe-
matics. 1930;7:71–79.

[66] Kolmogorov A. Sur la notion de la moyenne. Rendiconti accademia dei
lincei. 1930;12:388–391.

[67] de Finetti B. Sul concetto di media. Giornale dell’ istituto italiano degli
attuari. 1931;2:369–396.

[68] Hong CS. A generalization of the quasilinear mean with application to the
measurement of income inequality and decision theory resolving the Allais
paradox. Econometrica. 1983;51:1065–1092.

[69] Luce RD. Utility of gains and losses: measurement-theoretical and experi-
mental approaches. Mahwah, NJ: Erlbaum; 2000.

[70] Train KE. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:
Cambridge University Press; 2009.

[71] Luce RD. Individual choice behavior. Oxford: Wiley; 1959.

[72] McFadden D. Conditional logit analysis of qualitative choice behavior. In:
Zarembka P, editor. Frontiers in econometrics. New York: Academic Press;
1974. .

[73] Meginnis JR. A new class of symmetric utility rules for gambles, subjective
marginal probability functions, and a generalized Bayes rule. Proceedings
of the American Statistical Association, Business and Economic Statistics
Section. 1976;p. 471–476.

[74] Fudenberg D, Kreps D. Learning mixed equilibria. Games and Economic
Behavior. 1993;5:320–367.

[75] Lee R, Wolpert DH. Game-Theoretic Modeling of Human Behavior in Mid-
Air Collisions. In: T Guy MK, Wolpert DH, editors. Decision-Making with
Imperfect Decision Makers. Springer; 2011. .

[76] Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cam-
bridge, MA: MIT Press; 1998.

[77] Rieskamp J. The probabilistic nature of preferential choice. Journal of Ex-
perimental Psychology: Learning, Memory, and Cognition. 2008;34:1446–
1465.

[78] Glscher J, Daw N, Dayan P, O’Doherty JP. States versus rewards: dissocia-
ble neural prediction error signals underlying model-based and model-free
reinforcement learning. Neuron. 2010;66(4):585–95.

[79] McDannald MA, Takahashi YK, Lopatina N, Pietras BW, Jones JL,
Schoenbaum G.
Model-based learning and the contribution of the or-
bitofrontal cortex to the model-free world. Eur J Neurosci. 2012;35(7):991–
6.

25


[80] Busemeyer JR, Diederich A. Survey of decision field theory. Mathematical
Social Sciences. 2002;43(3):345–370.

[81] Bogacz R, Brown E, Moehlis J, Holmes P, Cohen JD.
The physics of
optimal decision making: a formal analysis of models of performance in
two-alternative forced-choice tasks. Psychological Review. 2006;113:700–
765.

[82] Friston K. The free-energy principle: a rough guide to the brain? Trends
in Cognitive Science. 2009;13:293–301.

[83] Friston K.
The free-energy principle: a unified brain theory?
Nature
Review Neuroscience. 2010;11:127–138.

[84] Ortega PA, Braun DA. A minimum relative entropy principle for learning
and acting. Journal of Artificial Intelligence Research. 2010;38:475–511.

[85] Shahlizi CR.
Dynamics of bayesian updating with dependent data and
misspecified models. Electronic Journal of Statistics. 2009;3:1039–1074.

26