1689 lines
60 KiB
Plaintext
1689 lines
60 KiB
Plaintext
Thermodynamics as a theory of decision-making
|
||
with information processing costs
|
||
|
||
Pedro A. Ortega and Daniel A. Braun
|
||
|
||
July 31, 2012
|
||
|
||
Abstract
|
||
|
||
Perfectly rational decision-makers maximize expected utility, but cru-
|
||
cially ignore the resource costs incurred when determining optimal actions.
|
||
Here we propose an information-theoretic formalization of bounded ratio-
|
||
nal decision-making where decision-makers trade off expected utility and
|
||
information processing costs. Such bounded rational decision-makers can
|
||
be thought of as thermodynamic machines that undergo physical state
|
||
changes when they compute. Their behavior is governed by a free en-
|
||
ergy functional that trades off changes in internal energy—as a proxy for
|
||
utility—and entropic changes representing computational costs induced
|
||
by changing states. As a result, the bounded rational decision-making
|
||
problem can be rephrased in terms of well-known concepts from statis-
|
||
tical physics.
|
||
In the limit when computational costs are ignored, the
|
||
maximum expected utility principle is recovered. We discuss the relation
|
||
to satisficing decision-making procedures as well as links to existing theo-
|
||
retical frameworks and human decision-making experiments that describe
|
||
deviations from expected utility theory. Since most of the mathematical
|
||
machinery can be borrowed from statistical physics, the main contribution
|
||
is to axiomatically derive and interpret the thermodynamic free energy as
|
||
a model of bounded rational decision-making.
|
||
|
||
1
|
||
Introduction
|
||
|
||
In everyday life decision-makers often have to make fast and frugal choices
|
||
[1, 2]. Consider, for example, an antelope that quickly has to choose a direction
|
||
of flight when faced with a predator. By the time an antelope had considered
|
||
all possible flight paths to determine the optimal one, it would most probably
|
||
be already eaten. In general, decision-makers seem to trade off the expected
|
||
desirability of the consequences of an action against the effort and resources
|
||
(time, money, food, computational effort, knowledge, opportunity costs, etc.)
|
||
required for searching the optimum [3, 4].
|
||
Classic theories of decision making generally ignore information-processing
|
||
costs by assuming that decision makers always pick the option with maximum
|
||
|
||
1
|
||
|
||
|
||
return—irrespective of the effort or the resources it might take to find or com-
|
||
pute the optimal action [5, 6, 7]. Such decision-makers are described as perfectly
|
||
rational. However, being perfectly rational seems to contradict our intuition of
|
||
real-world decision-making, where information processing constraints play an
|
||
important role [1]. This has led to an abundant literature on bounded rational-
|
||
ity [8, 9, 10, 11]. Unlike perfectly rational decision makers, bounded rational
|
||
decision-makers are subject to information processing constraints, that is they
|
||
may have limited time and speed to process a limited amount of information.
|
||
|
||
1.1
|
||
Thermodynamic Intuition
|
||
|
||
a)
|
||
b)
|
||
|
||
pV
|
||
|
||
(1 − p)V
|
||
|
||
A
|
||
|
||
B
|
||
|
||
Figure 1: The Molecule-In-A-Box Device.
|
||
(a) Initially, the molecule moves
|
||
freely within a space of volume V delimited by two pistons. The compartments
|
||
A and B correspond to the two logical states of the device. (b) Then, the lower
|
||
piston pushes the molecule into part A having volume V ′ = pV .
|
||
|
||
Here we follow a thermodynamic argument [12] that allows measuring re-
|
||
source (or information) costs in physical systems in units of energy. The gen-
|
||
erality of the argument relies on the fact that ultimately any real agent has to
|
||
be incarnated in a physical system, as any process of information processing
|
||
must always be accompanied by a pertinent physical process [13]. In the fol-
|
||
lowing we conceive of information processing as changes in information states
|
||
(i.e. ultimately changes of probability distributions), which consequently im-
|
||
plies changes in physical states, such as flipping gates in a transistor, changing
|
||
voltage on a microchip, or even changing location of a gas particle. Such state
|
||
changes in physical systems are not for free, that is the do not happen sponta-
|
||
neously. Consequently, if we want to control a physical system into a desirable
|
||
state we also have to take into consideration that changing from the current
|
||
state to the desirable state incurs a cost.
|
||
According to Landauer’s principle, one can postulate a formal correspon-
|
||
dence between one unit of information and one unit of energy [14, 15, 16].
|
||
Consider representing one bit of information using one of the following logical
|
||
devices: a molecule that can be located either on the top or the bottom part of
|
||
|
||
2
|
||
|
||
|
||
a box; a coin whose face-up side can be either head or tail; a door that can be
|
||
either open or closed; a train that can be orientated facing either north or south;
|
||
and so forth. Assume that all these devices are initialized in an undetermined
|
||
logical state, where the first state has probability p and the second probability
|
||
1 − p. Now, imagine you want to set these devices to their first logical state.
|
||
In the case of the molecule in a box, this means the following. Initially, the
|
||
molecule is uniformly moving around within a space confined by two pistons as
|
||
depicted in Figure 1a. Assuming that the initial volume is V , the molecule has
|
||
to be pushed by the lower piston into the upper part of the box having volume
|
||
V ′ = pV (Figure 1b). From information theory, we know that the number of
|
||
bits that we fix by this operation is given by − log p.
|
||
To make things concrete, we assume that the device has diathermal walls
|
||
and is immersed in a heat bath at constant temperature T . Since the walls are
|
||
diathermal, the temperature inside of the box is maintained at the temperature
|
||
of the heat bath. We model the particle as an ideal gas. When an ideal gas
|
||
is compressed under isothermal conditions from an initial volume V to a final
|
||
volume V ′, then the work is calculated as
|
||
|
||
W = −
|
||
� V ′
|
||
|
||
V
|
||
|
||
NkT
|
||
|
||
V
|
||
dV = NkT ln V
|
||
|
||
V ′ ,
|
||
(1)
|
||
|
||
where N ≥ 0 is the amount of substance and k > 0 is the Boltzmann constant.
|
||
The minus sign is just a convention to denote work done by the piston rather
|
||
than by the gas. If we assume N = 1 and make use of the fact that V ′ = pV
|
||
we get
|
||
|
||
W = kT ln V
|
||
|
||
pV = −kT ln p = − kT
|
||
|
||
log e log p = −γmol log p,
|
||
|
||
where the constant γmol :=
|
||
RT
|
||
log e > 0 can be interpreted as the conversion factor
|
||
between one unit of information and one unit of energy for the molecule-in-a-box
|
||
device.
|
||
How do we compute the information and work for the case of the coin, door
|
||
and train devices? The important observation is that we can model these cases
|
||
as if they were like molecule-in-a-box devices, with the difference that their
|
||
conversion factors between units of information and units of work are different.
|
||
Hence, the number of bits fixed while these devices are set to the first state is
|
||
given by − log p, i.e. exactly as in the case of the molecule. However, the work
|
||
is given by
|
||
−γcoin log p,
|
||
−γdoor log p,
|
||
and
|
||
− γtrain log p
|
||
|
||
respectively, where γcoin, γdoor and γtrain are the associated conversion factors
|
||
between units of information. Obviously, γmol ≤ γcoin ≤ γdoor ≤ γtrain. The
|
||
point is that changes in knowledge states are costly and that these costs are
|
||
proportional to the information. In the next section, we derive a general ex-
|
||
pression of information costs in physical systems that make decisions.
|
||
|
||
3
|
||
|
||
|
||
2
|
||
Information-Theoretic Foundations
|
||
|
||
2.1
|
||
Resource Costs
|
||
|
||
We model any observable sequential process, such as a sequence of interactions
|
||
or a sequence of computation steps, as a filtration on a measure space.
|
||
To
|
||
simplify our exposition, we consider only finite measure spaces.
|
||
Let (Ω, Σ)
|
||
denote a measurable space, where Ω denotes the sample space and where Σ is
|
||
a σ-algebra on Ω. Let p be a conditional probability measure on (Ω, Σ), such
|
||
that for any two events A, B ∈ Σ, p(A|B) denotes the conditional probability of
|
||
the A given B, where the condition B plays the role of the current information
|
||
state of the process. The sequential realization of a process is modelled as a
|
||
sequence of conditions A1, A2, . . . , AT on the sample space Ω, where each new
|
||
condition At refines the current information state �
|
||
τ≤t Aτ by excluding the
|
||
complement A∁
|
||
t .
|
||
We further assume that a transformation of an information state from B to
|
||
(A ∩ B) entails a cost ρ(A|B) that could be measured in dollars, time or any
|
||
arbitrary scale of effort. Moreover, we assume that this transformation cost is
|
||
decomposable; that is, if we undergo a knowledge change from C to (A∩B ∩C),
|
||
then we should pay the same cost as undergoing a change first from C to (B∩C)
|
||
and then from (B ∩ C) to (A ∩ B ∩ C). Finally, the quintessential information-
|
||
theoretic postulate is that conditional probabilities impose a monotonic order
|
||
over transformation costs1. We can sum up our postulates as follows:
|
||
|
||
Definition 1 (Axioms of Transformation Costs). Let (Ω, Σ) be a measurable
|
||
space and let p : (Σ × Σ) → [0, 1] be a conditional probability measure over
|
||
Σ (i.e. for any A ∈ Σ, p(·|A) is a probability measure over A). A function
|
||
ρ : (Σ × Σ) → R+ is a transformation cost function for p iff it has the following
|
||
three properties for all events A, B, C, D ∈ Σ:
|
||
|
||
A1. real-valued:
|
||
∃f,
|
||
ρ(A|B) = f
|
||
�
|
||
p(A|B)
|
||
�
|
||
∈ R,
|
||
A2. additive:
|
||
ρ(A ∩ B|C) = ρ(A|C) + ρ(B|A ∩ C),
|
||
|
||
A3. monotonic:
|
||
[ρ(A|B) > ρ(C|D)]
|
||
⇔
|
||
[p(A|B) ≶ p(C|D)].
|
||
|
||
These three properties enforce a strict correspondence between probabilities
|
||
and transformation costs [18, 19].
|
||
|
||
Theorem 1 (Transformation Costs ↔ Probabilities). If f is such that ρ(A|B) =
|
||
f(p(A|B)) for every choice of the probability space (Ω, Σ, p), then f is of the form
|
||
|
||
f(·) = − 1
|
||
|
||
β log(·),
|
||
|
||
where β is a real parameter.
|
||
|
||
1This intuition is central for optimal coding theory where short codewords are assigned
|
||
to frequent events and long codewords are assigned to rare events [17]. Therefore, we could
|
||
regard the codeword length as a valuable resource that we have to bet on events with different
|
||
probabilities.
|
||
|
||
4
|
||
|
||
|
||
That is, the transformation cost ρ(A|B) is proportional to the information
|
||
content − log p(A|B), where the parameter β plays the role of the conversion
|
||
factor.
|
||
The logarithmic mapping between probabilities and “costs” is well-
|
||
known in information theory, and there are many possible ways to derive it
|
||
[20, 21]. The important observation is that our derivation stems purely from
|
||
postulates regarding transformation costs.
|
||
According to Definition 1, transformation costs measure the relative cost of
|
||
an event relative to a reference event. However, we can also introduce an absolute
|
||
cost measure to single events such that transformation costs are obtained as
|
||
differences.
|
||
|
||
Definition 2 (Potential). Let ρ be a transformation cost function. A set func-
|
||
tion φ : Σ → R is called a cost potential for ρ iff for all A, B ∈ Σ,
|
||
|
||
φ(Ω) := φ0
|
||
φ(A ∩ B) := φ(B) + ρ(A|B)
|
||
∀A, B ∈ Σ,
|
||
|
||
where φ0 is an arbitrary real value.
|
||
|
||
One can easily verify that this potential is well defined for all events, and
|
||
that ρ(A|B) = φ(A ∩ B) − φ(B). It captures the intuition that starting out
|
||
from the high-probability event B with potential φ(B) one has to pay the cost
|
||
ρ(A|B) to arrive at the low-probability event A ∩ B with potential φ(A ∩ B).
|
||
In the following, consider a reference set S ∈ Σ having a measurable parti-
|
||
tion X. Cost potentials have an important recursive structure: the cost potential
|
||
of an event is uniquely determined by the potential of its constituent events.
|
||
If X is a measurable partition of a reference event S ∈ Σ, then
|
||
|
||
φ(S) = − 1
|
||
|
||
β log
|
||
�
|
||
|
||
x∈X
|
||
e−βφ(x).
|
||
(2)
|
||
|
||
Furthermore, the probability of a member x ∈ X of the partition relative to S
|
||
can be expressed as a Gibbs measure:
|
||
|
||
p(x|S) = e−βφ(x)
|
||
|
||
e−βφ(S) =
|
||
e−βφ(x)
|
||
|
||
�
|
||
x∈X e−βφ(x) .
|
||
(3)
|
||
|
||
In statistical physics it is well-known that the Gibbs measure satisfies a varia-
|
||
tional principle in the free energy, which is defined as
|
||
|
||
Fβ[q] :=
|
||
�
|
||
|
||
x∈X
|
||
q(x)φ(x) + 1
|
||
|
||
β
|
||
|
||
�
|
||
|
||
x∈X
|
||
q(x) log q(x).
|
||
(4)
|
||
|
||
More specifically, it is well known that for any probability measure q over the
|
||
partition X of S,
|
||
|
||
F[q] ≥ F[p] = − 1
|
||
|
||
β log φ(S),
|
||
(5)
|
||
|
||
where the lower bound is attained by the Gibbs measure p(x) ∝ e−βφ(x). Equa-
|
||
tions (2) to (5) constitute fundamental results that will be generalized and
|
||
interpreted in the next section.
|
||
|
||
5
|
||
|
||
|
||
2.2
|
||
Gains and Losses
|
||
|
||
Equipped with the results from the preceding section, we can now proceed to
|
||
model a bounded rational decision maker. Because transformation costs matter,
|
||
we model a decision as a transformation of a prior behavior into a final behavior,
|
||
where we represent the direction of change as a utility criterion.
|
||
The Gibbs measure in (3) allows us describing a probability measure p over
|
||
a partition X in terms of a cost potential φ over X. In particular, we see that a
|
||
decision-maker’s a priori behavior or belief described by p0(x) and φ0(x) changes
|
||
to p(x) and φ(x) if he is exposed to the gain (or loss) U(x), such that
|
||
|
||
φ(x) = φ0(x) − U(x)
|
||
(6)
|
||
|
||
and
|
||
p(x) ∝ e−βφ0(x)+βU(x) ∝ p0(x)eβU(x)
|
||
(7)
|
||
|
||
as illustrated in Figure 1. The function U represents either gains or losses and
|
||
not absolute levels of costs, because it expresses a difference in the potential
|
||
U(x) = φ0(x)−φ(x). The equilibrium distribution (7) that arises in a change can
|
||
also be characterized in terms of a variational principle, in a manner analogous
|
||
to (5).
|
||
|
||
Theorem 2 (Negative Free Energy Difference). Let p0(x) and p(x) be the Gibbs
|
||
measures with potentials φ0(x) and φ(x) and resource parameter β. Let F0 and
|
||
F be the free energies minimized by p0 and p respectively. Then, the negative
|
||
free energy difference −∆F = F0 − F is
|
||
|
||
−∆F =
|
||
�
|
||
|
||
x∈X
|
||
p(x)U(x) − 1
|
||
|
||
β
|
||
|
||
�
|
||
|
||
x∈X
|
||
p(x) log p(x)
|
||
|
||
p0(x),
|
||
(8)
|
||
|
||
where U(x) = φ0(x) − φ(x).
|
||
|
||
Since the difference in the negative free energy −∆F = F − F0 has the same
|
||
dependency on p as the free energy F, we can use −∆F directly as a variational
|
||
principle in p.
|
||
|
||
Corollary 3 (Variational Principle). The negative free energy difference pro-
|
||
vides a variational principle for the equilibrium distribution, i.e.
|
||
|
||
−∆F[q] :=
|
||
�
|
||
|
||
x∈X
|
||
q(x)U(x) − 1
|
||
|
||
β
|
||
|
||
�
|
||
|
||
x∈X
|
||
q(x) log q(x)
|
||
|
||
p0(x)
|
||
|
||
is maximized by
|
||
|
||
p(x) = 1
|
||
|
||
Z p0(x)eβU(x),
|
||
where
|
||
Z :=
|
||
�
|
||
|
||
x∈X
|
||
eβU(x).
|
||
|
||
Furthermore,
|
||
∆F[q] ≤ ∆F[p] = 1
|
||
|
||
β log Z.
|
||
|
||
6
|
||
|
||
|
||
φ0
|
||
|
||
−U
|
||
|
||
φ = φ0 − U
|
||
|
||
Initial
|
||
Final
|
||
|
||
low
|
||
|
||
high
|
||
|
||
Probability
|
||
|
||
Figure 2: Representing a decision maker as a thermodynamic system, the be-
|
||
havior of the decision-maker exposed to a gain U can be expressed as a change
|
||
of his initial cost potential φ0 to a final cost potential φ, where φ = φ0 − U.
|
||
The choice or belief probabilities of the decision-maker change according to (7)
|
||
from p0 to p.
|
||
|
||
2.3
|
||
Choice & Belief Probabilities
|
||
|
||
The distribution (7) can be interpreted both as an action or observation prob-
|
||
ability in the context of bounded rational decision-making. In the case of ac-
|
||
tions, p0 represents the a priori choice probability of the agent which is refined
|
||
to the choice probability p when evaluating the imposed gain (or loss) U. The
|
||
associated change in probability depends on the resource parameter β and cor-
|
||
responds to the computation that is necessary to evaluate the gains (or losses).
|
||
In the case of observations, p0 represents the a priori belief of the agent given
|
||
by a probabilistic model, which is then distorted due to the presence of possible
|
||
gains (or losses) that are evaluated by the holder of the belief. This way, model
|
||
uncertainty and risk-aversion can be parameterized by β.
|
||
For different values of β the distribution (7) has the following limits
|
||
|
||
lim
|
||
β→∞ p(x)
|
||
=
|
||
δ(x − x∗),
|
||
x∗ = max
|
||
x
|
||
U(x)
|
||
|
||
lim
|
||
β→0 p(x)
|
||
=
|
||
p0(x)
|
||
|
||
lim
|
||
β→−∞ p(x)
|
||
=
|
||
δ(x − x∗),
|
||
x∗ = min
|
||
x U(x).
|
||
|
||
In the case of actions the three limits imply the following: The limit β → ∞
|
||
corresponds to the perfectly rational actor that infallibly selects the action that
|
||
maximizes gain (or minimizes loss −U(x). The limit β → 0 is an actor without
|
||
resources that simply selects his action according to his prior. The limit β →
|
||
−∞ corresponds to an actor that is perfectly “anti-rational” and always selects
|
||
the action with the worst outcome. In the case of observations the three limits
|
||
correspond to an extremely optimistic observer (β → ∞) who believes only in
|
||
the best possible outcome, an extremely pessimistic observer (β → −∞) who
|
||
anticipates only the worst, and a risk-neutral Bayesian observer (β → 0) who
|
||
simply relies on the probabilistic model p0.
|
||
|
||
7
|
||
|
||
|
||
2.4
|
||
The Certainty Equivalent
|
||
|
||
In statistical physics [22], the free energy difference
|
||
|
||
∆A = ∆E − Q = W
|
||
|
||
measures the amount of available “good energy” (work W) by subtracting the
|
||
“bad energy” (heat Q) from the total energy ∆E = E[U]. The crucial physical
|
||
intuition is that we have uncertainty about some aspects of the objects that
|
||
make up the heat energy, for example we do not know the exact trajectories
|
||
of all gas particles at temperature β. This uncertainty means that we do not
|
||
have full control over the objects and cannot extract all the energy as work
|
||
[12]. Economically speaking, the physical concept of work, and therefore also
|
||
the difference in free energy, measures the certainty equivalent of a gain (or
|
||
loss) that is contaminated by uncertainty. In general, we can therefore use the
|
||
free energy difference to ascribe a certainty equivalent value to choice situations
|
||
of the form (7). As can be seen from Corollary 3, this value is given by the
|
||
log partition function, i.e. the logarithm of the normalization constant Z. For
|
||
different values of β, the certainty equivalent takes the following limits
|
||
|
||
lim
|
||
β→∞
|
||
1
|
||
β log Z
|
||
=
|
||
max
|
||
x
|
||
U(x)
|
||
|
||
lim
|
||
β→0
|
||
1
|
||
β log Z
|
||
=
|
||
�
|
||
|
||
x
|
||
p0(x)U(x)
|
||
|
||
lim
|
||
β→−∞
|
||
1
|
||
β log Z
|
||
=
|
||
min
|
||
x U(x).
|
||
|
||
Again, the case β → ∞ corresponds to the perfectly rational actor (or the
|
||
extremely optimistic observer), the case β → −∞ corresponds to the perfectly
|
||
“anti-rational” actor (or the extremely pessimistic observer) and the case β → 0
|
||
corresponds to the actor that has no resources (or the risk-neutral observer) such
|
||
that the best one can expect is the expected gain or loss.
|
||
Corollary 3 has two interpretations in statistical physics, either as an in-
|
||
stantiation of a minimum energy principle or as a maximum entropy principle
|
||
[22]. Accordingly, (7) can either be seen as the distribution that maximizes the
|
||
entropy given a constraint on the expectation value of U or as the distribution
|
||
that minimizes the expectation of −U given a constraint on the entropy of p.
|
||
In the context of observer modeling, the first interpretation provides a principle
|
||
for estimation and the second interpretation provides a principle for bounded
|
||
rational decision-making in the case of acting, which is a maximum expected
|
||
gain principle with a relative entropy constraint that bounds the information-
|
||
processing capacity of the decision-maker. In the relative entropy we recognize
|
||
the term 1
|
||
|
||
β log p(x) as our transformation costs ρ from Theorem 1 such that we
|
||
can express the negative free energy difference −∆F as
|
||
|
||
−∆F = E[U] − E[R],
|
||
|
||
where U(x) = φ0(x)− φ(x) represents gains (or losses) and R(x) = ρ(x)− ρ0(x)
|
||
represents the extra resource costs required to achieve the gain (or loss) U.
|
||
|
||
8
|
||
|
||
|
||
We can therefore see how the variational principle of Corollary 3 formalizes a
|
||
trade-off between expected gains (or losses) and information processing costs.
|
||
|
||
3
|
||
Summary of Main Concepts
|
||
|
||
In decision theory, choices between alternatives are usually formalized as choices
|
||
between lotteries, where a lottery is formalized as a set X of possible out-
|
||
comes, a probability distribution p0 over X, and a real-valued function U over
|
||
X called the utility function. In particular expected utility theory predicts that
|
||
a decision-maker always chooses the lottery with the higher expected utility
|
||
E[U] = �
|
||
x p0(x)U(x). Here we introduce the notion of a bounded lottery as a
|
||
lottery that is additionally characterized by a resource parameter β ∈ R that
|
||
captures the resource constraints of the decision-maker.
|
||
We have derived a thermodynamic framework for bounded lotteries from
|
||
simple axioms that measure information processing cost—see also [19].
|
||
The
|
||
most important difference of bounded decision-making compared to perfectly
|
||
rational decision-making is that the bounded decision-maker will not be able
|
||
to choose infallibly the best lottery. In fact, the resource constraints lead to
|
||
stochastic choice behavior which can be characterized by a probability distribu-
|
||
tion. The decision process then transforms an initial choice probability p0 into
|
||
a final choice probability p by taking into account the utility gains (or losses)
|
||
and the transformation costs. This transformation process can be formalized as
|
||
|
||
p(x) = 1
|
||
|
||
Z p0(x)eβU(x),
|
||
where
|
||
Z =
|
||
�
|
||
|
||
x′
|
||
p0(x′)eβU(x′).
|
||
(9)
|
||
|
||
Accordingly, the choice pattern of the decision-maker is predicted by the prob-
|
||
ability p. Crucially, the probability p extremizes the variational principle
|
||
|
||
max
|
||
p
|
||
|
||
� �
|
||
|
||
x
|
||
p(x)U(x) − 1
|
||
|
||
β
|
||
|
||
�
|
||
|
||
x
|
||
p(x) log p(x)
|
||
|
||
p0(x)
|
||
|
||
�
|
||
.
|
||
(10)
|
||
|
||
These two terms can be interpreted as determinants of bounded rational decision-
|
||
making in that they formalize a trade-off between an expected utility gain (first
|
||
term) and the information processing cost of transforming p0 into p (second
|
||
term). The certainty equivalent value of a bounded lottery can be obtained by
|
||
inserting the choice probability p from (9) into (10), yielding
|
||
|
||
V = 1
|
||
|
||
β log
|
||
��
|
||
|
||
x
|
||
p0(x)eβU(x)
|
||
�
|
||
,
|
||
(11)
|
||
|
||
which corresponds to the log partition sum. For different values of β, the cer-
|
||
|
||
9
|
||
|
||
|
||
a)
|
||
|
||
b)
|
||
|
||
Umax
|
||
|
||
Umin
|
||
|
||
E[U]
|
||
β
|
||
|
||
β1
|
||
|
||
β1
|
||
|
||
β2
|
||
|
||
β2
|
||
|
||
β3
|
||
|
||
β3
|
||
|
||
Figure 3: a) Negative free energy difference ∆F versus the resource parame-
|
||
ter β. The resource parameter allows modeling decision-makers with bounded
|
||
resources, either when generating their own actions (β > 0) or when anticipat-
|
||
ing their environment (β < 0). The negative free energy difference corresponds
|
||
to the certainty equivalent. b) Distribution over the outcomes depending on
|
||
the resource parameter β. For large positive β the distribution concentrates on
|
||
the outcome with maximum gain Umax. For large negative β the distribution
|
||
concentrates on the worst outcome with gain Umin. For β = 0 the outcomes
|
||
follow the given distribution p0.
|
||
|
||
tainty equivalent takes the following limits
|
||
|
||
lim
|
||
β→∞
|
||
1
|
||
β log Z
|
||
=
|
||
max
|
||
x
|
||
U(x)
|
||
|
||
lim
|
||
β→0
|
||
1
|
||
β log Z
|
||
=
|
||
�
|
||
|
||
x
|
||
p0(x)U(x)
|
||
|
||
lim
|
||
β→−∞
|
||
1
|
||
β log Z
|
||
=
|
||
min
|
||
x U(x).
|
||
|
||
The case β → ∞ corresponds to the perfectly rational actor (or the extremely
|
||
optimistic observer), the case β → −∞ corresponds to the perfectly “anti-
|
||
rational” actor (or the extremely pessimistic observer) and the case β → 0
|
||
corresponds to the actor that has no resources (or the risk-neutral observer)
|
||
such that the best one can expect is the expected gain or loss. For illustration
|
||
see Figure 2.
|
||
|
||
10
|
||
|
||
|
||
4
|
||
Bounded Rationality and Satisficing
|
||
|
||
Herbert Simon [23] proposed in the 50s that bounded rational decision-makers
|
||
do not commit to an unlimited optimization by searching for the absolute best
|
||
option. Rather, they follow a strategy of satisficing, i.e. they settle for an option
|
||
that is good enough in some sense. Since then, it has been debated whether sat-
|
||
isficing decision-makers can be described as bounded rational decision-makers
|
||
that act optimally under resource constraints or whether optimization is the
|
||
wrong concept altogether [11]. If decision-makers did indeed explicitly attempt
|
||
to solve such a constrained optimization problem, this would lead to an infinite
|
||
regress and the paradoxical situation that a bounded rational decision-maker
|
||
would have to solve a more complex (i.e. constrained) optimization problem
|
||
than a perfectly rational decision-maker.
|
||
To resolve this paradox, the bounded rational decision maker must not be
|
||
able to reason about his constraints. He just searches randomly for the best
|
||
option, until his resources run out. An observer will then be able to assign a
|
||
probability distribution to the decision-maker’s choices and investigate how this
|
||
probability distribution changes depending on the available resources. Consider,
|
||
for example, an anytime algorithm that will compute a solution more and more
|
||
precisely the more time it has at its disposal. As one does not want to wait
|
||
forever for an answer, the anytime computation will be interrupted at some
|
||
point where one assumes that the answer is going to be good enough. This
|
||
concept of satisficing can be used to interpret Equation 7 which describes the
|
||
choice rule of a bounded rational decision-maker.
|
||
Consider the problem of picking the largest number in a sequence U0, U1, U2, . . .
|
||
of i.i.d. data, where each Ui ∈ U is drawn from a source with probability dis-
|
||
tribution µ. This could be, for instance, an urn with numbered balls that we
|
||
draw with replacement and we always keep track of the largest number seen so
|
||
far. After m draws the largest number will be given by
|
||
|
||
v := max{U1, U2, . . . , Um}.
|
||
|
||
Naturally, the larger the number of draws, the higher the chances of observing a
|
||
large number. The cumulative distribution function of choosing v after m draws
|
||
is given by
|
||
Fm(v) = F0(v)m,
|
||
(12)
|
||
|
||
where F0 is the cumulative distribution function of µ [24]. If we only cared about
|
||
finding the maximum with absolute certainty then we would need to draw an
|
||
infinite amount of times. However, a bounded rational decision-maker would
|
||
stop after a certain time, when he feels that the benefit of further exploration
|
||
does not justify the effort of further drawings. Thus, the number of draws in
|
||
this example can be regarded as a resource and the numbers on the balls can
|
||
be regarded as utilities. The behavior of the bounded rational decision-maker
|
||
is then stochastic even though he acts perfectly deterministically, in the sense
|
||
that he chooses option v with probability (12) given the resource constraint
|
||
m. According to (12), the more resources a decision-maker spends, the more he
|
||
|
||
11
|
||
|
||
|
||
a)
|
||
b)
|
||
|
||
M = 0
|
||
M = 8
|
||
|
||
M = 32
|
||
M = 128
|
||
|
||
Umax
|
||
|
||
0
|
||
M
|
||
|
||
E[v]
|
||
|
||
E[v] − M · c
|
||
|
||
Figure 4: a) Distributions over the maximum for various sample sizes (M +
|
||
1). The distribution µ over the ten values v in U = {1, 2, 3, . . ., 10} follows a
|
||
truncated Poisson distribution with parameter λ = 5, as can be seen in the
|
||
plot for M = 0. The distribution approaches a delta function over v = 10 for
|
||
increasing values of M. b) The expected maximum v versus sample size (M +1).
|
||
The incremental gain of the expected maximum is marginally decreasing as the
|
||
sample size increases (red). If the sampling process is associated with a cost—
|
||
e.g. c = 0.02 per sample in the figure—, then the penalized expected maximum
|
||
(black) reaches a unique maximum for a finite sample size—the optimal sample
|
||
size is M = 35 in the figure.
|
||
|
||
resembles a perfectly rational decision-maker that chooses the maximum number
|
||
(Figure 1a), since the expected utility increases monotonically with the amount
|
||
of resources spent (Figure 1b). Importantly, however, note that the marginal
|
||
increase in the expected utility diminishes with larger effort—hence larger and
|
||
larger effort pays out less and less in the end. Below we formalize this trade-off.
|
||
Here we show that the boundedness parameter β plays an analogous role to
|
||
the number of draws m. In the limit of a continuous cumulative function F0,
|
||
the density after m draws is given by pm(v) =
|
||
d
|
||
dvF0(v)m. We can now compute
|
||
the log odds for two random outcomes v and v′, which results in
|
||
|
||
log pm(v)
|
||
|
||
pm(v′) = (m − 1) log F0(v)
|
||
|
||
F0(v′) + log µ(v)
|
||
|
||
µ(v′),
|
||
|
||
where F0(v) is again the cumulative of µ. If we require the probabilities pm(v) to
|
||
be representable by a distribution of the exponential family such that pm(v) =
|
||
µ(v) exp(αU(v))
|
||
�
|
||
dv′µ(v′) exp(αU(v′)), we see that the log odds have the following relation
|
||
|
||
log pm(v)
|
||
|
||
pm(v′) = α (U(v) − U(v′)) + log µ(v)
|
||
|
||
µ(v′).
|
||
|
||
We see that α and m play the role of the number of samples or computations.
|
||
In general, the following theorem can be shown to hold.
|
||
|
||
12
|
||
|
||
|
||
Theorem 4. Let X be a finite set. Let Q and M be strictly positive probability
|
||
distributions over X. Let α be a positive integer. Define Mα as the probability
|
||
distribution over the maximum of α samples from M. Then, there are strictly
|
||
positive constants δ and ξ depending only on M such that for all α,
|
||
����
|
||
Q(x)eαU(x)
|
||
|
||
�
|
||
|
||
x′ Q(x′)eαU(x′) − Mα(x)
|
||
���� ≤ e−(α−ξ)δ.
|
||
|
||
Consequently, one can interpret the inverse temperature as a resource param-
|
||
eter that determines how many samples are drawn to estimate the maximum.
|
||
Note that the distribution M is arbitrary as long as it has the same support
|
||
as Q.
|
||
This interpretation can be extended to a negative α, by noting that
|
||
αU(x) = (−α)(−U(x)), i.e. instead of the maximum we take the minimum of
|
||
−α samples.
|
||
|
||
5
|
||
Sequential Decision-Making
|
||
|
||
In the case of sequential decision-making the assumption of uniform temper-
|
||
atures has to be relaxed—the proofs of the following theorems can be found
|
||
in [25]. In general, we can then dedicate different amounts of computational
|
||
resources to each node of a decision tree. However, this requires a translation
|
||
between a tree with a single temperature and to a tree with different tempera-
|
||
tures. This translation can be achieved using the following theorem
|
||
|
||
Theorem 5. Let P be the equilibrium distribution for a given inverse tem-
|
||
perature α, utility function U and reference distribution Q. If the temperature
|
||
changes to β while keeping P and Q fixed, then the utility function changes to
|
||
|
||
V (x) = U(x) −
|
||
�
|
||
1
|
||
α − 1
|
||
|
||
β
|
||
�
|
||
log P(x)
|
||
|
||
Q(x).
|
||
|
||
If we now define the reward as the change in utility of two subsequent nodes,
|
||
then the rewards of the resulting decision tree are given by
|
||
|
||
R(xt|x<t)
|
||
:=
|
||
�
|
||
V (x≤t) − V (x<t)
|
||
�
|
||
|
||
=
|
||
�
|
||
U(x≤t) − U(x<t)
|
||
�
|
||
−
|
||
�
|
||
1
|
||
α −
|
||
1
|
||
|
||
β(x<t)
|
||
�
|
||
log P(xt|x<t)
|
||
|
||
Q(xt|x<t).
|
||
|
||
This allows introducing a collection of node-specific (not necessarily time-specific)
|
||
inverse temperatures β(x<t), allowing for a greater degree of flexibility in the
|
||
representation of information costs. The next theorem states the connection
|
||
between the free energy and the general decision tree formulation.
|
||
|
||
Theorem 6. The free energy of the whole trajectory can be rewritten in terms
|
||
|
||
13
|
||
|
||
|
||
of rewards:
|
||
|
||
Fα[P] =
|
||
�
|
||
|
||
x≤T
|
||
P(x≤T )
|
||
�
|
||
U(x≤T ) − 1
|
||
|
||
α log P(x≤T )
|
||
|
||
Q(x≤T )
|
||
|
||
�
|
||
|
||
= U(ε) +
|
||
�
|
||
|
||
x≤T
|
||
P(x≤T )
|
||
|
||
T
|
||
�
|
||
|
||
t=1
|
||
|
||
�
|
||
R(xt|x<t) −
|
||
1
|
||
|
||
β(x<t) log P(xt|x<t)
|
||
|
||
Q(xt|x<t)
|
||
|
||
�
|
||
.
|
||
(13)
|
||
|
||
This translation allows applying the free energy principle to each node with
|
||
a different resource parameter β(x<t).
|
||
By writing out the sum in (13), one
|
||
realizes that this free energy has a nested structure where the latest time step
|
||
forms the innermost variational problem and all other variational problems of
|
||
the previous time steps can be solved recursively by working backwards in time.
|
||
This then leads to the following solution:
|
||
|
||
Theorem 7. The solution to the free energy in terms of rewards is given by
|
||
|
||
P(xt|x<t) =
|
||
1
|
||
|
||
Z(x<t)Q(xt|x<t) exp
|
||
�
|
||
β(x<t)
|
||
�
|
||
R(xt|x<t) +
|
||
1
|
||
|
||
β(x≤t) log Z(x≤t)
|
||
��
|
||
,
|
||
|
||
where Z(x≤T ) = 1 and where for all t < T
|
||
|
||
Z(x<t) =
|
||
�
|
||
|
||
xt
|
||
Q(xt|x<t) exp
|
||
�
|
||
β(x<t)
|
||
�
|
||
R(xt|x<t) +
|
||
1
|
||
|
||
β(x≤t) log Z(x≤t)
|
||
��
|
||
.
|
||
|
||
6
|
||
Limit Cases of Bounded Rational Control
|
||
|
||
As described in the previous section, the belief and action probabilities of an
|
||
agent in a sequential decision-making setup can be determined by recursion of
|
||
the log-partition function
|
||
|
||
V (x<t) =
|
||
1
|
||
|
||
β(x<t) log
|
||
��
|
||
|
||
xt
|
||
Q(xt|x<t) exp
|
||
�
|
||
β(x<t)
|
||
�
|
||
R(xt|x<t) + V (x≤t)
|
||
���
|
||
,
|
||
|
||
(14)
|
||
where we have introduced V (x≤t) =
|
||
1
|
||
|
||
β(x≤t) log Z(x≤t). If xt is an action variable
|
||
then Q(xt|x<t) reflects the prior policy and the agent’s rationality β(x<t) deter-
|
||
mines in how far the value R(xt|x<t)+V (x≤t) can be optimized by the agent. If
|
||
xt is an observation variable then Q(xt|x<t) reflects the agent’s prior belief and
|
||
the rationality of the environment β(x<t) indicates how much one should de-
|
||
viate from the prior belief considering the possible values R(xt|x<t) + V (x≤t).
|
||
Depending on β(x<t), different decision-making schemes can be recovered—
|
||
compare Figure 3.
|
||
|
||
1. KL control. When assuming a history-independent loss function r(xt),
|
||
Markov probabilities p0(xt|xt−1) and β(x<t) = β for all x<t, Equation (14)
|
||
simplifies to a recursion that is equivalent to z-iteration which has previ-
|
||
ously been suggested in [26, 27] to approximately solve MDPs by means
|
||
|
||
14
|
||
|
||
|
||
β1
|
||
|
||
β2
|
||
|
||
+∞
|
||
|
||
−∞
|
||
|
||
+∞
|
||
−∞
|
||
|
||
Anti-
|
||
Rational
|
||
Irrational
|
||
Rational
|
||
|
||
Risk-
|
||
Seeking
|
||
|
||
Risk-
|
||
Neutral
|
||
|
||
Risk-
|
||
Averse
|
||
|
||
Figure 5: Schematic illustration of how resource parameters can model a range of
|
||
decision-making schemes: (1)- risk-seeking, bounded rational; (2) risk-neutral,
|
||
perfectly rational; (3) risk-averse, perfectly rational; and (4) robust, perfectly
|
||
rational.
|
||
|
||
of linear algebra—see [28, 29] for details of this equivalence relation. In
|
||
[26, 27] the transition probabilities of the MDP are controlled directly
|
||
and the control costs are given by the Kullback-Leiber divergence of the
|
||
manipulated state transition probabilities with respect to a baseline distri-
|
||
bution that describes the passive dynamics. In our framework, this kind
|
||
of KL control corresponds to the special case where all random variables
|
||
are action variables and the agent has boundedness parameter β. The
|
||
stochasticity in this case, however, is not thought to arise from environ-
|
||
mental passive dynamics, but rather is a direct consequence of bounded
|
||
rational control in a (possibly) deterministic environment. The continuous
|
||
case of KL control relies on the formalism of path integrals [30, 31], but
|
||
essentially the same relation to bounded rationality can be established—
|
||
see [28] for details.
|
||
|
||
2. Optimal stochastic control. When assuming β(x<t) → ∞ for all action
|
||
variables and β(x<t) → 0 for all observation variables, we approach the
|
||
limit of the perfectly rational decision-maker in a stochastic environment.
|
||
In this limit, the log-partition function converges to the expected utility
|
||
and the decision-maker acts deterministically so as to maximize the ex-
|
||
pected utility. For action variables, recursion (14) becomes the well-known
|
||
Bellman Optimality Equation [32]—see [28, 29] for details.
|
||
|
||
3. Risk-sensitive control.
|
||
Risk-sensitive control [33] corresponds to a
|
||
decision-maker with β(x<t) → ∞ for all action variables and β(x<t) ̸= 0
|
||
for observation variables.
|
||
Risk-sensitivity in the context of continuous
|
||
KL control has been previously proposed in [34].
|
||
Mean-variance deci-
|
||
|
||
15
|
||
|
||
|
||
sion criteria used in finance can be equally derived [35]. Risk-sensitive
|
||
decision-makers do not simply maximize the expectation of the utility,
|
||
but also consider higher-order cumulants by optimizing a stress function
|
||
given by the log partition sum. A risk-averse decision-maker (β(x<t) < 0),
|
||
for example, discounts variability off the expected utility.
|
||
In contrast,
|
||
risk-seeking decision-makers (β(x<t) > 0) add value to the expected util-
|
||
ity in the face of variability. Risk-sensitivity biases the beliefs about the
|
||
environment optimistically (collaborative environment) or pessimistically
|
||
(adversarial environment). Alternatively, one could regard a collabora-
|
||
tive environment also as a bounded rational controller that can choose
|
||
its own observation—that is the environment behaves like an extension of
|
||
the agent with partial control. Importantly, the stress function is typically
|
||
assumed in risk-sensitive control schemes in the literature, whereas here
|
||
it falls out naturally—see [29] for more details.
|
||
|
||
4. Robust control.
|
||
When assuming β(x<t) → ∞ for all action vari-
|
||
ables and β(x<t) → −∞ for all observation variables, we approach the
|
||
limit of the robust decision-maker in an unknown environment. When
|
||
β(x<t) → −∞, the decision-maker makes a worst case assumption about
|
||
the environment, namely that it is strictly adversarial and perfectly ra-
|
||
tional. This leads to the well-known game-theoretic minimax problem.
|
||
Minimax problems have been used to reformulate robust control problems
|
||
that allow controllers to cope with model uncertainties [36, 37]. Robust
|
||
control problems have long been known to be related to risk-sensitive con-
|
||
trol [38, 39]. Here we derived both control types from the same variational
|
||
principle—see [29] for more details.
|
||
|
||
7
|
||
Discussion
|
||
|
||
In the proposed thermodynamic interpretation of bounded rationality, agents
|
||
with limited resources search for a maximum over a set by randomly drawing
|
||
elements from this set. This random search leads to a utility function that is
|
||
marginally decreasing when more search effort is allocated. When such agents
|
||
pay a search cost, the bounded rational optimum is to abort the search as
|
||
soon as the marginal returns are equal to the search cost. The resulting trade-
|
||
off between utility maximization and resource costs can be quantified by the
|
||
Kullback-Leibler divergence with respect to an initial policy or belief. This ini-
|
||
tial probability distribution corresponds to the initial state of a thermodynamic
|
||
system that changes when a new potential is imposed. The difference in the po-
|
||
tential corresponds to utility gains or losses in economic choice. The difference
|
||
in the free energy corresponds to physical work and the economic certainty-
|
||
equivalent. Thus, gains or losses that are associated with uncertainty are effec-
|
||
tively devalued or overvalued, depending on the sign of the resource parameter.
|
||
This way risk-sensitivity, robustness to model uncertainty and game-theoretic
|
||
minimax-strategies can arise naturally.
|
||
|
||
16
|
||
|
||
|
||
Bounded rationality.
|
||
Starting with Simon [23], bounded rationality has
|
||
been extensively studied in psychology, economics, political science, industrial
|
||
organization, computer science and artificial intelligence research—see for ex-
|
||
ample [40, 41, 42, 43, 10, 11, 44, 45]. Additionally, numerous experiments in
|
||
behavioral economics have shown that humans systematically violate perfect
|
||
rationality, that is they are bounded rational [46]. Probably the most closely
|
||
related approach to bounded rationality with respect to the present article is
|
||
quantal response equilibrium (QRE) game theory [47, 48, 49, 50]. QRE models
|
||
assume bounded rational players whose choice probabilities are given by the
|
||
Boltzmann distribution and whose rationality is determined by a temperature
|
||
parameter. Interactions of such bounded rational players can lead to game-
|
||
theoretic solutions that deviate from the Nash equilibrium. The QRE model
|
||
is a special case of the model presented here where all prior probabilities are
|
||
assumed to be uniform. These prior probabilities are crucial when defining the
|
||
certainty-equivalent that ranges from minimum to maximum via the expected
|
||
utility. As the certainty-equivalent corresponds to physical work, this also al-
|
||
lows to relate bounded rational decision-making to thermodynamic processes.
|
||
The distinction of a prior policy and a utility that is optimized to some extent
|
||
is fundamental to the notion of bounded rationality proposed in this paper and
|
||
therefore also affords a qualitative advance of the bounded rationality model in
|
||
QRE models.
|
||
|
||
Information theory in control and game theory.
|
||
As already discussed,
|
||
a number of papers have suggested the use of the relative entropy as a cost
|
||
function for control [26, 27, 51, 52]. Previously, Saridis [53] has framed optimal
|
||
and adaptive control as entropy minimization problems. Statistical physics has
|
||
also served as an inspiration to a number of other studies, for example, to an
|
||
information-theoretic approach to interactive learning [54], to use information
|
||
theory to approximate joint strategies in games with bounded rational players
|
||
[55] and to the problem of optimal search [56, 57], where the utility losses
|
||
correspond directly to search effort.
|
||
Recently, Tishby and Polani [58] have
|
||
shown how to apply information theory to understand information flow in the
|
||
action-observation cycle. The contribution of our study is to devise information-
|
||
theoretic axioms to quantify search costs in bounded optimization problems.
|
||
This allows for a unified treatment of control and game-theoretic problems, as
|
||
well as estimation and learning problems for both perfectly rational and bounded
|
||
rational agents. In the future it will be interesting to relate the thermodynamic
|
||
resource costs of bounded rational agents to more traditional notions of resource
|
||
costs in computer science like space and time requirements of algorithms [59].
|
||
|
||
Variational Preferences.
|
||
In the economic literature the Kullback-Leibler
|
||
divergence has appeared in the context of multiplier preference models that can
|
||
deal with model uncertainty [37]. Especially, it has been proposed that a bound
|
||
on the Kullback-Leibler divergence could be used to indicate how much of a
|
||
deviation from a proposed model p0 is allowed when computing robust decision
|
||
|
||
17
|
||
|
||
|
||
strategies that work under a range of models in the neighborhood of p0. In
|
||
variational preference models [60] this is generalized to models of the form
|
||
|
||
f ⪰ g ⇐⇒ min
|
||
p
|
||
|
||
��
|
||
u(f)dp + c(p)
|
||
�
|
||
≥ min
|
||
p
|
||
|
||
��
|
||
u(g)dp + c(p)
|
||
�
|
||
,
|
||
|
||
where c(p) can be interpreted as an ambiguity index that can explain effects of
|
||
ambiguity aversion. The thermodynamic certainty-equivalent of work—computed
|
||
as the log-partition sum—also falls within this preference model. However, an
|
||
important difference is that the choice in a thermodynamic system is not de-
|
||
terministic with respect to the certainty-equivalent, but stochastic following
|
||
a generalized Boltzmann distribution. Due to this stochasticity of the choice
|
||
behavior itself, the thermodynamic model can be linked to both bounded ratio-
|
||
nality and model uncertainty, whereas variational preference models have so far
|
||
concentrated on explaining effects of ambiguity aversion and model uncertainty.
|
||
|
||
Ellsberg’s and Allais’ paradox.
|
||
Two of the most famous deviations from
|
||
expected utility theory that have been consistently observed in human decision-
|
||
making are the paradoxa of Ellsberg [61] and Allais [62]. While the first paradox
|
||
has encouraged a large literature dealing with model uncertainty [37], the latter
|
||
paradox has led to the development of prospect theory [63, 64]. Ellsberg could
|
||
show that human choice in the face of ambiguity differs from decision-making
|
||
under risk where precise probability models are available. Humans typically
|
||
tend to avoid ambiguous options, rather than choosing the option with higher
|
||
expected utility. The observed ambiguity aversion can be modeled straightfor-
|
||
wardly by a bounded rational decision-maker by allowing some degree of mini-
|
||
maxing in the spirit of a risk-sensitive controller—see Supplementary Material
|
||
for details. Allais could show that humans frequently reverse their preferences
|
||
in choice tasks that may not lead to preference reversals according to expected
|
||
utility theory. These reversals typically occur for different levels of riskiness of
|
||
the same choices. The explanation of the Allais paradox within the framework of
|
||
bounded rationality is not as straightforward as the Ellsberg paradox, but may
|
||
involve context-dependent changes of the boundedness parameter or biases in
|
||
the decision-making process that lead to a generalized quasi- linear mean model
|
||
[65, 66, 67, 68], which provides an alternative account of preference reversals
|
||
of the Allais type without violating the principle of stochastic dominance—see
|
||
Supplementary Material for more details.
|
||
|
||
Stochastic Choice.
|
||
Stochastic choice rules have been extensively studied in
|
||
the psychological and econometric literature, in particular logit choice mod-
|
||
els based on the Boltzmann distribution [69, 70]. The literature on Boltzmann
|
||
distributions for decision-making goes back to Luce [71], extending through Mc-
|
||
Fadden [72], Meginnis [73], Fudenberg [74] and Wolpert [55, 75, 50]. Luce [71]
|
||
has studied stochastic choice rules of the form p(xi) ∼
|
||
wi
|
||
�
|
||
|
||
j wj , which includes
|
||
the Boltzmann distribution and the “softmax”-rule known in the reinforcement
|
||
learning literature [76]. McFadden [72] has shown that such distributions can
|
||
|
||
18
|
||
|
||
|
||
arise, for example, when utilities are contaminated with additive noise follow-
|
||
ing an extreme value distribution. While stochastic choice models are generally
|
||
accepted to account for human choices better than their deterministic coun-
|
||
terparts [77, 78, 79], they have also been strongly criticized, especially for a
|
||
property known as independence of irrelevant alternatives (IIA). Similar to the
|
||
independence axiom in expected utility theory, IIA implies that the ratio of
|
||
two choice probabilities does not depend on the presence of a third irrelevant
|
||
alternative in the choice set. What distinguishes the free energy equations from
|
||
above choice rules is that stochastic choice behavior is described by a generalized
|
||
exponential family distribution of the form p(x) ∼ p0(x) exp(βU(x)). Changing
|
||
the choice set might in general also change the prior p0(x), but more importantly
|
||
it might also change the resource parameter β.
|
||
|
||
Diffusion-to-bound models.
|
||
Diffusion-to-bound models typically model the
|
||
process of binary decision-making as a random walk process that terminates once
|
||
it hits one of two given decision bounds [80]. Each time step of the random walk
|
||
provides noisy evidence towards one of the two options. This implements a nat-
|
||
ural speed-accuracy trade-off: the further away the bounds the more reliable the
|
||
decision will be, as the noise can be averaged out, but also the longer one has to
|
||
wait. The resulting choice probabilities are identical to the choice probabilities
|
||
of a bounded rational decision-maker if we relate the decision bound of the ran-
|
||
dom walk with the boundedness parameter in 7—see Supplementary Material
|
||
for details. The boundedness parameter can then also be shown to be propor-
|
||
tional to the time required for the decision-making process. Decision-to-bound
|
||
models have been widely used in behavioral psychology and neuroscience to
|
||
explain probabilistic choice and reaction times in psychometric experiments—
|
||
see [81] for a review. Decision-makers that apply the decision-to-bound model
|
||
may be regarded as bounded rational decision-makers from a normative point
|
||
of view.
|
||
|
||
Free Energy Principle.
|
||
A central property of closed thermodynamic sys-
|
||
tems is that they minimize free energy. A free energy principle based on the
|
||
variational Bayes approach has recently also been proposed as a theoretical
|
||
framework to understand brain function [82, 83]. In this framework generative
|
||
models of the form p(y|h, a) explain how hidden causes h in the environment and
|
||
actions a produce observations y. The brain uses an approximative distribution
|
||
Q(h; a) to determine the hidden causes. The free energy
|
||
|
||
F = −
|
||
�
|
||
dh Q(h; a) ln P(y, h|a) −
|
||
�
|
||
dh Q(h; a) ln Q(h; a)
|
||
|
||
measures how well the brain is doing with this approximation. According to
|
||
[82, 83], action and perception consist in choosing a and Q respectively so as to
|
||
minimize this free energy. In light of the thermodynamic view of free energy,
|
||
maximizing the likelihood − ln P(y, h|a)—or minimizing surprise—is a partic-
|
||
ular choice of potential function φ, where the boundedness consists in being
|
||
|
||
19
|
||
|
||
|
||
restricted to model class Q instead of having full disposal of p(y|h, a). More
|
||
generally, variational Bayes methods that use particular classes of distributions
|
||
to approximate the posterior could thus be regarded as a form of bounded in-
|
||
ference within this picture.
|
||
|
||
8
|
||
Conclusion
|
||
|
||
Thermodynamics provides a framework for bounded rationality that can be
|
||
both descriptive and prescriptive. It is descriptive in the sense that it describes
|
||
behavior that is clearly sub-optimal from the point of view of a perfect rational
|
||
decision-maker with infinite resources. It is prescriptive in the sense that it
|
||
prescribes how a bounded rational actor should behave optimally given resource
|
||
constraints formalized by β. As we have argued in this paper, bounded rational
|
||
decision-making provides an overarching principle in both senses in economics,
|
||
engineering, artificial intelligence, psychology and neuroscience.
|
||
A thermodynamic model of bounded rational decision-making has also two
|
||
advantages over traditional decision theory of perfect rationality. First, it allows
|
||
connecting computational processes with real physical processes, for example
|
||
how much entropy they generate and how much energy they require [13]. Sec-
|
||
ond, it suggests a notion of intelligence that is closely related to the process
|
||
of evolution. It is straightforward to see that bounded rational controllers of
|
||
the form (9) share their structure with Bayes’ rule, where we identify the prior
|
||
p0(x), the likelihood model eβU(x) and the posterior p(x), normalized by the
|
||
partition function, thus, establishing a close link between inference and con-
|
||
trol [84]. Furthermore, both bounded rational controllers and Bayes’ rule share
|
||
their structural form with discrete replicator dynamics that model evolutionary
|
||
processes [85], where samples (a population) are pushed through a fitness func-
|
||
tion (likelihood, gain function) that biases the distribution of the population,
|
||
thereby transforming a distribution p0 to a new distribution p. In this picture
|
||
different hypotheses x compete for probability mass over subsequent iterations,
|
||
favoring those x that have a lower-than-average cost. Just like the evolution-
|
||
ary random processes of variation and selection created intelligent organisms on
|
||
a phylogenetic timescale, similar random processes might underlie (bounded)
|
||
intelligent behavior in individuals on an ontogenetic timescale.
|
||
|
||
9
|
||
Acknowledgments
|
||
|
||
This study was supported by the DFG, Emmy Noether grant BR4164/1-1.
|
||
|
||
References
|
||
|
||
[1] Gigerenzer G, Todd PM, ABC Research Group. Simple Heuristics That
|
||
Make Us Smart. New York: Oxford University Press; 1999.
|
||
|
||
20
|
||
|
||
|
||
[2] Gladwell M. Blink: the power of thinking without thinking. New York:
|
||
Little, Brown and Company; 2005.
|
||
|
||
[3] Niv Y, Daw ND, Joel D, Dayan P. Tonic dopamine: opportunity costs
|
||
and the control of response vigor.
|
||
Psychopharmacology (Berl). 2007
|
||
Apr;191(3):507–520.
|
||
|
||
[4] Daw ND. ‘Model-based reinforcement learning as cognitive search: neu-
|
||
rocomputational theories’. In: Cognitive search: evolution algorithms and
|
||
the brain. Boston: MIT Press; 2012. .
|
||
|
||
[5] Von Neumann J, Morgenstern O. Theory of Games and Economic Behavior.
|
||
Princeton: Princeton University Press; 1944.
|
||
|
||
[6] Savage LJ. The Foundations of Statistics. New York: John Wiley and
|
||
Sons; 1954.
|
||
|
||
[7] Fishburn P. The Foundations of Expected Utility. Dordrecht: D. Reidel
|
||
Publishing; 1982.
|
||
|
||
[8] Simon H. Theories of Bounded Rationality. In: Radner CB, Radner R, ed-
|
||
itors. Decision and Organization. Amsterdam: North Holland Publ.; 1972.
|
||
p. 161–176.
|
||
|
||
[9] Simon H. Models of Bounded Rationality. Cambridge, MA: MIT Press;
|
||
1984.
|
||
|
||
[10] Rubinstein A. Modeling Bounded Rationality. Cambridge, MA: MIT Press;
|
||
1999.
|
||
|
||
[11] Gigerenzer G, Selten R. In: Bounded rationality: the adaptive toolbox.
|
||
Cambridge, MA: MIT Press; 2001. .
|
||
|
||
[12] Feynman RP. The Feynman Lectures on Computation. Addison-Wesley;
|
||
1996.
|
||
|
||
[13] Tribus M, McIrvine EC. Energy and Information. Scientific American.
|
||
1971;225:179–188.
|
||
|
||
[14] Landauer R. Irreversibility and Heat Generation in the Computing Process.
|
||
IBM Journal of Research and Development. 1961;5(3):183–191.
|
||
|
||
[15] Bennett CH. Logical Reversibility of Computation. IBM Journal of Re-
|
||
search and Development. 1973;17(6):525–532.
|
||
|
||
[16] Bennett CH. The thermodynamics of computationa review. International
|
||
Journal of Theoretical Physics. 1982;21(12):905–940.
|
||
|
||
[17] MacKay DJC. Information Theory, Inference, and Learning Algorithms.
|
||
Cambridge University Press; 2003.
|
||
|
||
21
|
||
|
||
|
||
[18] Ortega PA, Braun DA.
|
||
A conversion between utility and information.
|
||
In: Proceedings of the third conference on artificial general intelligence.
|
||
Atlantis Press; 2010. p. 115–120.
|
||
|
||
[19] Ortega PA.
|
||
A Unified Framework for Resource-Bounded Autonomous
|
||
Agents Interacting with Unknown Environments.
|
||
Department of Engi-
|
||
neering, University of Cambridge, UK; 2011.
|
||
|
||
[20] Shannon CE. A mathematical theory of communication. Bell System Tech-
|
||
nical Journal. 1948 Jul and Oct;27:379–423 and 623–656.
|
||
|
||
[21] Csisz´ar I. Axiomatic Characterizations of Information Measures. Entropy.
|
||
2008;10:261–273.
|
||
|
||
[22] Callen HB. Thermodynamics and an introduction to thermostatistics. New
|
||
York: John Wiley & Sons; 1985.
|
||
|
||
[23] Simon HA. Rational choice and the structure of the environment. Psycho-
|
||
logical Review. 1956;63(2):129–38.
|
||
|
||
[24] Gumbel EJ. Statistics of Extremes. New York: Columbia University Press;
|
||
1958.
|
||
|
||
[25] Ortega PA, A BD. Free Energy and the Generalized Optimality Equa-
|
||
tions for Sequential Decision Making. arXiv:12053997v1 (presented at the
|
||
European Workshop for Reinforcement Learning). 2012;.
|
||
|
||
[26] Todorov E. Linearly solvable Markov decision problems. In: Advances in
|
||
Neural Information Processing Systems. vol. 19; 2006. p. 1369–1376.
|
||
|
||
[27] Todorov E. Efficient computation of optimal actions. Proceedings of the
|
||
National Academy of Sciences USA. 2009;106:11478–11483.
|
||
|
||
[28] Braun DA, Ortega PA. Path integral control and bounded rationality. In:
|
||
IEEE Symposium on adaptive dynamic programming and reinforcement
|
||
learning; 2011. p. 202–209.
|
||
|
||
[29] Ortega PA, Braun DA. Information, utility and bounded rationality. In:
|
||
Lecture notes on artificial intelligence. vol. 6830; 2011. p. 269–274.
|
||
|
||
[30] Kappen HJ. A linear theory for control of non-linear stochastic systems.
|
||
Physical Review Letters. 2005;95:200201.
|
||
|
||
[31] Theodorou E, Buchli J, Schaal S.
|
||
A generalized path integral ap-
|
||
proach to reinforcement learning. Journal of Machine Learning Research.
|
||
2010;11:3137–3181.
|
||
|
||
[32] Bellman RE. Dynamic Programming. Princeton, NJ: Princeton University
|
||
Press; 1957.
|
||
|
||
22
|
||
|
||
|
||
[33] Whittle P.
|
||
Risk-sensitive optimal control. New York: John Wiley and
|
||
Sons; 1990.
|
||
|
||
[34] van den Broek JL, Wiegerinck WAJJ, Kappen HJ.
|
||
Risk-sensitive path
|
||
integral control. In: UAI. vol. 6; 2010. p. 1–8.
|
||
|
||
[35] Markowitz H. Portfolio Selection. The Journal of Finance. 1952;7:77–91.
|
||
|
||
[36] Ba¸sar T, Bernhard P.
|
||
H-infinity optimal control and related minimax-
|
||
design problems: a dynamic game approach. Boston: Birkh¨auser; 1991.
|
||
|
||
[37] Hansen LP, Sargent TJ.
|
||
Robustness.
|
||
Princeton: Princeton University
|
||
Press; 2008.
|
||
|
||
[38] Jacobson D. Optimal stochastic linear systems with exponential perfor-
|
||
mance criteria and their relation to deterministic differential games. IEEE
|
||
T Automat Contr AC. 1973;18:124–131.
|
||
|
||
[39] Glover K, Boyle J. State-space formulae for all stabilizing controllers that
|
||
satisfy an H-norm bound and relations to risk sensitivity. Syst Control
|
||
Lett. 1988;11:167–172.
|
||
|
||
[40] Lipman B. Information Processing and Bounded Rationality: A Survey.
|
||
Canadian Journal of Economics. 1995;28(1):42–67.
|
||
|
||
[41] Russell SJ. Rationality and Intelligence. In: Mellish C, editor. Proceedings
|
||
of the Fourteenth International Joint Conference on Artificial Intelligence.
|
||
San Francisco: Morgan Kaufmann; 1995. p. 950–957.
|
||
|
||
[42] Russell SJ, Subramanian D. Provably bounded-optimal agents. Journal of
|
||
Artificial Intelligence Research. 1995;3:575–609.
|
||
|
||
[43] Aumann RJ. Rationality and Bounded Rationality. Games and Economic
|
||
Behavior. 1997 October;21(1-2):2–14.
|
||
|
||
[44] Kahneman D. Maps of Bounded Rationality: Psychology for Behavioral
|
||
Economics. American Economic Review. 2003 December;93(5):1449–1475.
|
||
|
||
[45] Spiegler R.
|
||
Bounded Rationality and Industrial Organization.
|
||
Oxford:
|
||
Oxford University Press; 2011.
|
||
|
||
[46] Camerer C. Behavioral Game Theory: Experiments in Strategic Interac-
|
||
tion. Princeton: Princeton University Press; 2003.
|
||
|
||
[47] D MR, R PT.
|
||
Quantal Response Equilibria for Normal Form Games.
|
||
Games and Economic Behavior. 1995 July;10(1):6–38.
|
||
|
||
[48] Mckelvey R, Palfrey TR. Quantal Response Equilibria for Extensive Form
|
||
Games. Experimental Economics. 1998;1:9–41.
|
||
|
||
[49] Anderson SP, Goeree JK, Holt CA. The logit equilibrium: a perspective on
|
||
intuitive behavioral anomalies. Southern Economic Journal. 2002;69:21–47.
|
||
|
||
23
|
||
|
||
|
||
[50] Wolpert DH, Harre M, Bertschinger N, Olbrich E, Jost J. Hysteresis ef-
|
||
fects of changing parameters of noncooperative games. Physical Review E.
|
||
2012;85:036102.
|
||
|
||
[51] Peters J, M¨ulling K, Altun Y. Relative entropy policy search. In: AAAI;
|
||
2010. .
|
||
|
||
[52] Kappen HJ, G´omez V, Opper M. Optimal control as a graphical model
|
||
inference problem. Machine Learning. 2012;1:1–11.
|
||
|
||
[53] Saridis G. Entropy formulation for optimal and adaptive control. IEEE
|
||
Transactions on Automatic Control. 1988;33:713–721.
|
||
|
||
[54] Still S. An information-theoretic approach to interactive learning. Euro-
|
||
physics Letters. 2009;85:28005.
|
||
|
||
[55] Wolpert DH. Information theory - the bridge connecting bounded rational
|
||
game theory and statistical physics. In: Braha D, Bar-Yam Y, editors.
|
||
Complex Engineering Systems. Perseus Books; 2004. .
|
||
|
||
[56] Stone LD. Theory of optimal search. New York: Academic Press; 1998.
|
||
|
||
[57] Jaynes ET. Entropy and search theory. In: Smith CR, Grandy WT, editors.
|
||
Maximum entropy and Bayesian methods in inverse problems. Dordrecht:
|
||
Reidel; 1985. .
|
||
|
||
[58] Tishby N, Polani D. Information Theory of Decisions and Actions. In: Vas-
|
||
silis T Hussain, editor. Perception-reason-action cycle: Models, algorithms
|
||
and systems. Berlin: Springer; 2011. .
|
||
|
||
[59] Vitanyi PMB. Time, space, and energy in reversible computing. In: Pro-
|
||
ceedings of the 2nd ACM conference on Computing frontiers; 2005. p. 435–
|
||
444.
|
||
|
||
[60] Maccheroni F, Marinacci M, Rustichini A.
|
||
Ambiguity aversion, robust-
|
||
ness, and the variational representation of preferences.
|
||
Econometrica.
|
||
2006;74:1447–1498.
|
||
|
||
[61] Ellsberg D. Risk, Ambiguity and the Savage Axioms. The Quaterly Journal
|
||
of Economics. 1961;75:643–669.
|
||
|
||
[62] Allais M. Le comportment de l’homme rationnel devant la risque: critique
|
||
des postulats et axiomes de l’ecole Americaine. Econometrica. 1953;21:503–
|
||
546.
|
||
|
||
[63] Kahneman D, Tversky A. Prospect Theory: An Analysis of Decision under
|
||
Risk. Econometrica. 1979;47:263–291.
|
||
|
||
[64] Tversky A, Kahneman D. Advances in prospect theory: Cumulative repre-
|
||
sentation of uncertainty”. Journal of Risk and Uncertainty. 1992;5:297–323.
|
||
|
||
24
|
||
|
||
|
||
[65] Nagumo M. ¨Uber eine Klasse der Mittelwerte. Japan Journal of Mathe-
|
||
matics. 1930;7:71–79.
|
||
|
||
[66] Kolmogorov A. Sur la notion de la moyenne. Rendiconti accademia dei
|
||
lincei. 1930;12:388–391.
|
||
|
||
[67] de Finetti B. Sul concetto di media. Giornale dell’ istituto italiano degli
|
||
attuari. 1931;2:369–396.
|
||
|
||
[68] Hong CS. A generalization of the quasilinear mean with application to the
|
||
measurement of income inequality and decision theory resolving the Allais
|
||
paradox. Econometrica. 1983;51:1065–1092.
|
||
|
||
[69] Luce RD. Utility of gains and losses: measurement-theoretical and experi-
|
||
mental approaches. Mahwah, NJ: Erlbaum; 2000.
|
||
|
||
[70] Train KE. Discrete Choice Methods with Simulation. 2nd ed. Cambridge:
|
||
Cambridge University Press; 2009.
|
||
|
||
[71] Luce RD. Individual choice behavior. Oxford: Wiley; 1959.
|
||
|
||
[72] McFadden D. Conditional logit analysis of qualitative choice behavior. In:
|
||
Zarembka P, editor. Frontiers in econometrics. New York: Academic Press;
|
||
1974. .
|
||
|
||
[73] Meginnis JR. A new class of symmetric utility rules for gambles, subjective
|
||
marginal probability functions, and a generalized Bayes rule. Proceedings
|
||
of the American Statistical Association, Business and Economic Statistics
|
||
Section. 1976;p. 471–476.
|
||
|
||
[74] Fudenberg D, Kreps D. Learning mixed equilibria. Games and Economic
|
||
Behavior. 1993;5:320–367.
|
||
|
||
[75] Lee R, Wolpert DH. Game-Theoretic Modeling of Human Behavior in Mid-
|
||
Air Collisions. In: T Guy MK, Wolpert DH, editors. Decision-Making with
|
||
Imperfect Decision Makers. Springer; 2011. .
|
||
|
||
[76] Sutton RS, Barto AG. Reinforcement Learning: An Introduction. Cam-
|
||
bridge, MA: MIT Press; 1998.
|
||
|
||
[77] Rieskamp J. The probabilistic nature of preferential choice. Journal of Ex-
|
||
perimental Psychology: Learning, Memory, and Cognition. 2008;34:1446–
|
||
1465.
|
||
|
||
[78] Glscher J, Daw N, Dayan P, O’Doherty JP. States versus rewards: dissocia-
|
||
ble neural prediction error signals underlying model-based and model-free
|
||
reinforcement learning. Neuron. 2010;66(4):585–95.
|
||
|
||
[79] McDannald MA, Takahashi YK, Lopatina N, Pietras BW, Jones JL,
|
||
Schoenbaum G.
|
||
Model-based learning and the contribution of the or-
|
||
bitofrontal cortex to the model-free world. Eur J Neurosci. 2012;35(7):991–
|
||
6.
|
||
|
||
25
|
||
|
||
|
||
[80] Busemeyer JR, Diederich A. Survey of decision field theory. Mathematical
|
||
Social Sciences. 2002;43(3):345–370.
|
||
|
||
[81] Bogacz R, Brown E, Moehlis J, Holmes P, Cohen JD.
|
||
The physics of
|
||
optimal decision making: a formal analysis of models of performance in
|
||
two-alternative forced-choice tasks. Psychological Review. 2006;113:700–
|
||
765.
|
||
|
||
[82] Friston K. The free-energy principle: a rough guide to the brain? Trends
|
||
in Cognitive Science. 2009;13:293–301.
|
||
|
||
[83] Friston K.
|
||
The free-energy principle: a unified brain theory?
|
||
Nature
|
||
Review Neuroscience. 2010;11:127–138.
|
||
|
||
[84] Ortega PA, Braun DA. A minimum relative entropy principle for learning
|
||
and acting. Journal of Artificial Intelligence Research. 2010;38:475–511.
|
||
|
||
[85] Shahlizi CR.
|
||
Dynamics of bayesian updating with dependent data and
|
||
misspecified models. Electronic Journal of Statistics. 2009;3:1039–1074.
|
||
|
||
26
|
||
|
||
|