Files
intellecton/archive/friston/The_Free_Energy_Principle_Friston_2010.md
T

216 KiB
Raw Blame History

                                                                                                                                                            REVIEWS
                                              The free-energy principle:  
                                              a unified brain theory?
                                              Karl Friston
                                              Abstract | A free-energy principle has been proposed recently that accounts for action, 
                                              perception and learning. This Review looks at some key brain theories in the biological (for 
                                              example, neural Darwinism) and physical (for example, information theory and optimal 
                                              control theory) sciences from the free-energy perspective. Crucially, one key theme runs 
                                              through each of these theories — optimization. Furthermore, if we look closely at what is 
                                              optimized, the same quantity keeps emerging, namely value (expected reward, expected 
                                              utility) or its complement, surprise (prediction error, expected cost). This is the quantity that 
                                              is optimized under the free-energy principle, which suggests that several global brain 
                                              theories might be unified within a free-energy framework.
           Free energy                      Despite the wealth of empirical data in neuroscience,                  Motivation: resisting a tendency to disorder. The 
           An information theory measure    there are relatively few global theories about how the                 defining characteristic of biological systems is that 
           that bounds or limits (by being  brain works. A recently proposed free-energy principle                 they maintain their states and form in the face of a 
           greater than) the surprise on                                                                                                                      36
                                            for adaptive systems tries to provide a unified account                constantly changing environment              . From the point 
           sampling some data, given a      of action, perception and learning. Although this prin-                of view of the brain, the environment includes both 
           generative model.                ciple has been portrayed as a unified brain theory1, its               the external and the internal milieu. This maintenance 
           Homeostasis                      capacity to unify different perspectives on brain function             of order is seen at many levels and distinguishes bio-
           The process whereby an open      has yet to be established. This Review attempts to place               logical from other self-organizing systems; indeed, the 
           or closed system regulates its   some key theories within the free-energy framework, in                 physiology of biological systems can be reduced almost 
           internal environment to          the hope of identifying common themes. I first review                  entirely to their homeostasis7. More precisely, the rep-
           maintain its states within       the free-energy principle and then deconstruct several                 ertoire of physiological and sensory states in which an 
           bounds.                          global brain theories to show how they all speak to the                organism can be is limited, and these states define the 
           Entropy                          same underlying idea.                                                  organisms phenotype. Mathematically, this means that 
           The average surprise of                                                                                 the probability of these (interoceptive and exterocep-
           outcomes sampled from a          The free-energy principle                                              tive) sensory states must have low entropy; in other 
           probability distribution or      The free-energy principle (BOX 1) says that any self-                  words, there is a high probability that a system will 
           density. A density with low 
           entropy means that, on           organizing system that is at equilibrium with its environ-             be in any of a small number of states, and a low prob-
           average, the outcome is          ment must minimize its free energy2. The principle is                  ability that it will be in the remaining states. Entropy 
           relatively predictable. Entropy  essentially a mathematical formulation of how adaptive                 is also the average self information or surprise8  
           is therefore a measure of        systems (that is, biological agents, like animals or brains)           (more formally, it is the negative log-probability of an 
           uncertainty.
                                            resist a natural tendency to disorder36. What follows is              outcome). Here, a fish out of water would be in a sur-
                                            a non-mathematical treatment of the motivation and                     prising state (both emotionally and mathematically). 
                                            implications of the principle. We will see that although the           A fish that frequently forsook water would have high 
                                            motivation is quite straightforward, the implications are              entropy. Note that both surprise and entropy depend 
           The Wellcome Trust Centre        complicated and diverse. This diversity allows the prin-               on the agent: what is surprising for one agent (for 
           for Neuroimaging,                ciple to account for many aspects of brain structure and               example, being out of water) may not be surprising 
           University College London,       function and lends it the potential to unify different per-            for another. Biological agents must therefore mini-
           Queen Square, London, 
           WC1N 3BG, UK.                    spectives on how the brain works. In subsequent sections,              mize the long-term average of surprise to ensure that 
           email:                          I discuss how the principle can be applied to neuronal                 their sensory entropy remains low. In other words,  
           k.friston@fil.ion.ucl.ac.uk      systems as viewed from these perspectives. This Review                 biological systems somehow manage to violate the  
           doi:10.1038/nrn2787              starts in a rather abstract and technical way but then tries           fluctuation theorem, which generalizes the second law 
           Published online  
           13 January 2010                                                                                                                 9
                                            to unpack the basic idea in more familiar terms.                       of thermodynamics .
           NATuRE REvIEWs | NeuroscieNce                                                                                                     voluME 11 | FEBRuARy 2010 | 127
                                                                    © 2010 Macmillan Publishers Limited. All rights reserved
               REVIEWS
                                                            
                                                           Box 1 | The free-energy principle
                                                           Part a of the figure shows the dependencies among the                                    a
                                                           quantities that define free energy. These include the                                            Environment                                            Agent
                                                           internal states of the brain μ(t) and quantities describing its 
                                                           exchange with the environment: sensory signals (and their                                                                  Sensations
                                                                                           T                                                                                          ~     ~       ~
                                                           motion) ˜s(t) = [s,s,s″…]  plus action a(t). The environment                                                             s = g(x, ϑ) + z
                                                           is described by equations of motion, which specify the 
                                                           trajectory of its hidden states. The causes ϑ ⊃ {x˜ , θ, γ } of                                External states                                    Internal states
                                                           sensory input comprise hidden states x˜ (t), parameters θ                                       ~     ~          ~                                                 ~
                                                                                                                                                           ˙
                                                                                                                                                           x = f(x,                                        μ = arg min F(s,
                                                           and precisions γcontrolling the amplitude of the random                                                  a, ϑ) + w                                                   μ)
                                                                                   
                                                           fluctuations  z˜ (t) and  w˜ (t). Internal brain states and action 
                                                           minimize free energy F(s˜ ,μ), which is a function of sensory                                                     Action or control signals
                                                           input and a probabilistic representation q(ϑ|μ) of its causes.                                                                           ~
                                                                                                                                                                                  a = arg min F(s,
                                                           This representation is called the recognition density and is                                                                                μ)
                                                           encoded by internal states μ.
                                                              The free energy depends on two probability densities:                                 b
                                                           the recognition density q(ϑ|μ) and one that generates                                                         Free-energy bound on surprise
                                                           sensory samples and their causes, p(s˜ ,ϑ|m). The latter                                                                   ~
                                                                                                                                                                         F = <ln p(s,
                                                                                                                                                                                         ϑ | m)>  + <ln q(ϑ | μ)>
                                                           represents a probabilistic generative model (denoted by                                                                              q                  q
                                                           m), the form of which is entailed by the agent or brain.                                   Action minimizes prediction errors
                                                           Part b of the figure provides alternative expressions for the                              F = D(q(ϑ                          ~
                                                                                                                                                                   | μ) || p(ϑ))  <ln p(s(a) | ϑ, m)>q
                                                           free energy to show what its minimization entails: action                                  a = arg max Accuracy
                                                           can reduce free energy only by increasing accuracy (that is, 
                                                           selectively sampling data that are predicted). Conversely,                                                                Perception optimizes predictions
                                                           optimizing brain states makes the representation an                                                                                               ~          ~
               Surprise                                    approximate conditional density on the causes of sensory                                                                  F = D(q(ϑ | μ) || p(ϑ | s))  ln p(s |  m)
               (Surprisal or self information.)            input. This enables action to avoid surprising sensory                                                                    μ = arg max Divergence
               The negative log-probability of             encounters. A more formal description is provided below.
               an outcome. An improbable                   optimizing the sufficient statistics (representations)
               outcome (for example, water                                                                                                                                                   Nature Reviews | Neuroscience
                                                           Optimizing the recognition density makes it a posterior or conditional density on the causes of sensory data: this can be 
               flowing uphill) is therefore                seen by expressing the free energy as surprise In p(s˜ ,| m) plus a Kullback-Leibler divergence between the recognition and 
               surprising.                                 conditional densities (encoded by the internal states in the figure). Because this difference is always positive, minimizing 
               Fluctuation theorem                         free energy makes the recognition density an approximate posterior probability. This means the agent implicitly infers or 
               (A term from statistical                    represents the causes of its sensory samples in a Bayes-optimal fashion. At the same time, the free energy becomes a tight 
               mechanics.) Deals with the                  bound on surprise, which is minimized through action.
               probability that the entropy                optimizing action
               of a system that is far from the            Acting on the environment by minimizing free energy enforces a sampling of sensory data that is consistent with the 
               thermodynamic equilibrium                   current representation. This can be seen with a second rearrangement of the free energy as a mixture of accuracy and 
               will increase or decrease over              complexity. Crucially, action can only affect accuracy (encoded by the external states in the figure). This means that  
               a given amount of time. It                  the brain will reconfigure its sensory epithelia to sample inputs that are predicted by the recognition density — in other 
               states that the probability of 
               the entropy decreasing                      words, to minimize prediction error.
               becomes exponentially smaller 
               with time.
               Attractor                                      In short, the long-term (distal) imperative — of main-                                Crucially, free energy can be evaluated because it is a 
               A set to which a dynamical                taining states within physiological bounds — translates                                    function of two things to which the agent has access: its 
               system evolves after a long               into a short-term (proximal) avoidance of surprise.                                        sensory states and a recognition density that is encoded 
               enough time. Points that                  surprise here relates not just to the current state, which                                 by its internal states (for example, neuronal activity 
               get close to the attractor                cannot be changed, but also to movement from one state                                     and connection strengths). The recognition density is a 
               remain close, even under                  to another, which can change. This motion can be com-                                      probabilistic representation of what caused a particular 
               small perturbations.
                                                         plicated and itinerant (wandering) provided that it revis-                                 sensation.
               Kullback-Leibler divergence               its a small set of states, called a global random attractor10,                                  This (variational) free-energy construct was  
               (Or information divergence,               that are compatible with survival (for example, driving a                                  introduced into statistical physics to convert difficult  
               information gain or cross                 car within a small margin of error). It is this motion that                                probability-density integration problems into eas-
               entropy.) A non-commutative 
                                                                                                                                                                                                  11
               measure of the non-negative               the free-energy principle optimizes.                                                       ier optimization problems . It is an information  
               difference between two                         so far, all we have said is that biological agents must                               theoretic quantity (like surprise), as opposed to a 
               probability distributions.                avoid surprises to ensure that their states remain within                                  thermo dynamic quantity. variational free energy has 
               Recognition density                       physiological bounds (see supplementary information s1                                     been exploited in machine learning and statistics to 
                                                                                                                                                                                                                        1214
               (Or approximating conditional            (box) for a more formal argument). But how do they                                         solve many inference and learning problems                                . In this 
               density.) An approximate                 do this? A system cannot know whether its sensations                                       setting, surprise is called the (negative) model evidence. 
               probability distribution of the           are surprising and could not avoid them even if it did                                     This means that minimizing surprise is the same as 
               causes of data (for example,              know. This is where free energy comes in: free energy is                                   maximizing the sensory evidence for an agents exist-
               sensory input). It is the product         an upper bound on surprise, which means that if agents                                     ence, if we regard the agent as a model of its world. In 
               of inference or inverting a 
               generative model.                         minimize free energy, they implicitly minimize surprise.                                   the present context, free energy provides the answer to 
               128 | FEBRuARy 2010 | voluME 11                                                                                                                                             www.nature.com/reviews/neuro
                                                                                        © 2010 Macmillan Publishers Limited. All rights reserved
                                                                                                                                                                               REVIEWS
                                                  a fundamental question: how do self-organizing adap-                               In summary, the free energy rests on a model of how 
                                                  tive systems avoid surprising states? They can do this by                      sensory data are generated and on a recognition density 
                                                  minimizing their free energy. so what does this involve?                       on the models parameters (that is, sensory causes). Free 
                                                                                                                                 energy can be reduced only by changing the recognition 
                                                  Implications: action and perception. Agents can                                density to change conditional expectations about what is 
                                                  suppress free energy by changing the two things it depends                     sampled or by changing sensory samples (that is, sensory 
            Generative model                      on: they can change sensory input by acting on the world                       input) so that they conform to expectations. In what fol-
            A probabilistic model (joint          or they can change their recognition density by chang-                         lows, I consider these implications in light of some key 
            density) of the dependencies          ing their internal states. This distinction maps nicely                        theories about the brain.
            between causes and                    onto action and perception (BOX 1). one can see what this 
            consequences (data), from             means in more detail by considering three mathematically                       The Bayesian brain hypothesis
            which samples can be 
            generated. It is usually                                                                                                                                   17
                                                  equivalent formulations of free energy (see supplementary                      The Bayesian brain hypothesis  uses Bayesian probability 
            specified in terms of the             information s2 (box) for a mathematical treatment).                            theory to formulate perception as a constructive process 
            likelihood of data, given their           The first formulation expresses free energy as energy                      based on internal or generative models. The underlying 
            causes (parameters of a model)                                                                                                                                                      1822
            and priors on the causes.             minus entropy. This formulation is important for three                         idea is that the brain has a model of the world                      that 
                                                                                                                                                                                      2328
                                                  reasons. First, it connects the concept of free energy as                      it tries to optimize using sensory inputs                 . This idea is 
            Conditional density                                                                                                                                        20 
                                                  used in information theory with concepts used in sta-                          related to analysis by synthesis and epistemological autom-
            (Or posterior density.) The                                                                                              19
            probability distribution of           tistical thermodynamics. second, it shows that the free                        ata . In this view, the brain is an inference machine that 
                                                                                                                                                                                          18,22,25
            causes or model parameters,           energy can be evaluated by an agent because the energy                         actively predicts and explains its sensations                  . Central 
            given some data; that is, a           is the surprise about the joint occurrence of sensations                       to this hypothesis is a probabilistic model that can gener-
            probabilistic mapping from            and their perceived causes, whereas the entropy is sim-                        ate predictions, against which sensory samples are tested 
            observed data to causes.              ply that of the agents own recognition density. Third, it                     to update beliefs about their causes. This generative 
            Prior                                 shows that free energy rests on a generative model of the                      model is decomposed into a likelihood (the probability of 
            The probability distribution or       world, which is expressed in terms of the probability of a                     sensory data, given their causes) and a prior (the a priori 
            density of the causes of data         sensation and its causes occurring together. This means                        probability of those causes). Perception then becomes the 
            that encodes beliefs about            that an agent must have an implicit generative model of                        process of inverting the likelihood model (mapping from 
            those causes before observing         how causes conspire to produce sensory data. It is this                        causes to sensations) to access the posterior probability of 
            the data.                             model that defines both the nature of the agent and the                        the causes, given sensory data (mapping from sensations 
            Bayesian surprise                     quality of the free-energy bound on surprise.                                  to causes). This inversion is the same as minimizing the 
            A measure of salience based               The second formulation expresses free energy as                            difference between the recognition and posterior densi-
            on the Kullback-Leibler               surprise plus a divergence term. The (perceptual) diver-                       ties to suppress free energy. Indeed, the free-energy for-
            divergence between the                gence is just the difference between the recognition den-                      mulation was developed to finesse the difficult problem 
            recognition density (which            sity and the conditional density (or posterior density) of the                 of exact inference by converting it into an easier optimi-
            encodes posterior beliefs) and 
            the prior density. It                 causes of a sensation, given the sensory signals. This con-                    zation problem1114. This has furnished some powerful 
            measures the information that         ditional density represents the best possible guess about                      approximation techniques for model identification and 
            can be recognized in the data.        the true causes. The difference between the two densities                      comparison (for example, variational Bayes or ensemble 
            Bayesian brain hypothesis             is always non-negative and free energy is therefore an                         learning29). There are many interesting issues that attend 
            The idea that the brain uses          upper bound on surprise. Thus, minimizing free energy                          the Bayesian brain hypothesis, which can be illuminated 
            internal probabilistic                by changing the recognition density (without changing                          by the free-energy principle; we will focus on two.
            (generative) models to update         sensory data) reduces the perceptual divergence, so that                           The first is the form of the generative model and 
            posterior beliefs, using sensory      the recognition density becomes the conditional density                        how it manifests in the brain. one criticism of Bayesian 
            information, in an                    and the free energy becomes surprise.                                          treatments is that they ignore the question of how prior 
            (approximately) Bayes-optimal 
            fashion.                                  The third formulation expresses free energy as com-                        beliefs, which are necessary for inference, are formed27. 
            Analysis by synthesis                 plexity minus accuracy, using terms from the model                             However, this criticism dissolves with hierarchical 
            Any strategy (in speech coding)       comparison literature. Complexity is the difference                            generative models, in which the priors themselves are 
                                                                                                                                               26,28
            in which the parameters of a          between the recognition density and the prior density                          optimized          . In hierarchical models, causes in one 
                                                                                                             15
            signal coder are evaluated by         on causes; it is also known as Bayesian surprise  and is the                   level generate subordinate causes in a lower level; sen-
            decoding (synthesizing) the           difference between the prior density — which encodes                           sory data per se are generated at the lowest level (BOX 2). 
            signal and comparing it with          beliefs about the state of the world before sensory data are                   Minimizing the free energy effectively optimizes empiri-
            the original input signal.            assimilated — and posterior beliefs, which are encoded                         cal priors (that is, the probability of causes at one level, 
            Epistemological automata              by the recognition density. Accuracy is simply the sur-                        given those in the level above). Crucially, because empir-
            Possibly the first theory for why     prise about sensations that are expected under the recog-                      ical priors are linked hierarchically, they are informed 
            top-down influences (mediated         nition density. This formulation shows that minimizing                         by sensory data, enabling the brain to optimize its prior 
            by backward connections in            free energy by changing sensory data (without changing                         expectations online. This optimization makes every level 
            the brain) might be important         the recognition density) must increase the accuracy of                         in the hierarchy accountable to the others, furnishing an 
            in perception and cognition.          an agents predictions. In short, the agent will selectively                   internally consistent representation of sensory causes at 
            Empirical prior                       sample the sensory inputs that it expects. This is known                       multiple levels of description. Not only do hierarchical 
            A prior induced by hierarchical                               16
            models; empirical priors              as active inference . An intuitive example of this process                     models have a key role in statistics (for example, ran-
                                                  (when it is raised into consciousness) would be feeling                        dom effects and parametric empirical Bayes models30,31), 
            provide constraints on the            our way in darkness: we anticipate what we might touch                         they may also be used by the brain, given the hierarchical 
            recognition density in the usual 
                                                                                                                                                                                   3234
            way but depend on the data.           next and then try to confirm those expectations.                               arrangement of cortical sensory areas                  .
            NATuRE REvIEWs | NeuroscieNce                                                                                                                     voluME 11 | FEBRuARy 2010 | 129
                                                                             © 2010 Macmillan Publishers Limited. All rights reserved
                                  REVIEWS
                                                                                                                                                                                                                                                                                                                             The second issue is the form of the recognition den-
                                      Box 2 | Hierarchical message passing in the brain                                                                                                                                                                                                                            sity that is encoded by physical attributes of the brain, 
                                                                                                                    (i)           (i)       (i)            (i)((i  1)                     i)                                                                                                                      such as synaptic activity, efficacy and gain. In general, 
                                                                                                                ξ = Π ε =Π (μ                                                g(μ ))
                                                                                                                   v              v        v               v        v                                                                                                                                              any density is encoded by its sufficient statistics (for exam-
                                                                                                                    (i)           (i)       (i)            (i)((i )                       i)
                                                                                                                ξ = Π ε =Π (Dμ  f(μ ))
                                                                                                                   x              x        x               x            x                                                                                                                                          ple, the mean and variance of a Gaussian form). The way 
                                                                                                                                                                                                                                                                                                                   the brain encodes these statistics places important con-
                                                                                                                                                                                                                                                                 (3)                                               straints on the sorts of schemes that underlie recognition: 
                                        Sensory                               (1)                                                                                                                                                                            ξ
                                                                                                                                                                                                                                                                v
                                            input                         ξv                                                                                              (2)                                                  (2)                                                                                 they range from free-form schemes (for example, particle  
                                                                                                                                                                       ξ
                                                                                                                                                                          v                                                 ξ
                                                                                                                                                                                                                               x                                       Backward:
                                                                                                                                         (1)                                                                                                                                                                                                   26                                                                                                                                3538
                                                                                                                                                                                                                                                                                                                   filtering  and probabilistic population codes                                                                                                                             ),  
                                           Forward:                                                                                   ξx                                                                                                                               predictions
                                           prediction                                                                                                                                                                                                                                                              which use a vast number of sufficient statistics, to sim-
                                           error                                                                                                                                                                                                                 (2)                                               pler forms, which make stronger assumptions about 
                                                                                                                                                                                                                                                             μ
                                                                         ~                                                                                                                                                                                       v                                                 the shape of the recognition density, so that it can be 
                                                                         s(t)                                                                                             (1)
                                                                                                                                                                      μ                                                          (2)
                                                                                                                                                                          v
                                                                                                                                                                                                                             μ                                                                                     encoded with a small number of sufficient statistics. The 
                                                                                                                                                                                                                                 x
                                                                                                                                          (1)
                                                                                                                                       μ                                                                                                                                                                           simplest assumed form is Gaussian, which requires only 
                                                                                                                                          x
                                                                                                                                                                                                                                                                                                                   the conditional mean or expectation — this is known 
                                                                                                                                                                                                                                                                                                                                                                                                 39
                                       Lower cortical areas                                                                                                                                                            Higher cortical areas                                                                       as the Laplace assumption , under which the free energy 
                                                                                                                             (i)              (i)((i)                       (i)         i + 1)
                                                                                                                                                                      T
                                                                                                                         ˙μ      = Dμ − (∂ ε ) ξξ−                                                                                                                                                                 is just the difference between the models predictions 
                                          Synaptic plasticity                                                                v                v             v                         v                                 Synaptic gain
                                                                     T                                                       (i)              (i)((i)                        i)                                                                                      T
                                                                                                                                                                      T                                                                                                                                            and the sensations or representations that are predicted. 
                                         μ       = −∂ ε ξμ = ½tr(∂Π(ξξ − Π(μ)))
                                             θ                θ                                                          ˙μ      = Dμ − (∂ ε ) ξ                                                                             γ                       γ                                  γ
                                               ij               ij                                                           x                x             x                                                                  i                       i                                                           Minimizing free energy then corresponds to explaining 
                                      The figure details a neuronal architecture that optimizes the conditional expectations of                                                                                                                                                                                    away prediction errors. This is known as predictive coding 
                                      causes in hierarchical models of sensory input. It shows the putative cells of origin of forward                                                                                                                                                                             and has become a popular framework for understand-
                                                                                                                                                                                                                Nature Reviews | Neuroscience
                                      driving connections that convey prediction error (grey arrows) from a lower area (for                                                                                                                                                                                        ing neuronal message passing among different levels of 
                                      example, the lateral geniculate nucleus) to a higher area (for example, V1), and nonlinear                                                                                                                                                                                                                                                  40
                                                                                                                                                                                                                                                                                                                   cortical hierarchies . In this scheme, prediction error 
                                      backward connections (black arrows) that construct predictions41. These predictions try to                                                                                                                                                                                   units compare conditional expectations with top-down 
                                      explain away prediction error in lower levels. In this scheme, the sources of forward and                                                                                                                                                                                    predictions to elaborate a prediction error. This predic-
                                      backward connections are superficial and deep pyramidal cells (upper and lower triangles),                                                                                                                                                                                   tion error is passed forward to drive the units in the 
                                      respectively, where state units are black and error units are grey. The equations represent a                                                                                                                                                                                level above that encode conditional expectations which 
                                      gradient descent on free energy using the generative model below. The two upper equations                                                                                                                                                                                    optimize top-down predictions to explain away (reduce) 
                                      describe the formation of prediction error encoded by error units, and the two lower 
                                      equations represent recognition dynamics, using a gradient descent on free energy.                                                                                                                                                                                           prediction error in the level below. Here, explaining  
                                      Generative models in the brain                                                                                                                                                                                                                                               away just means countering excitatory bottom-up 
                                      To evaluate free energy one needs a generative model of how the sensorium is caused.                                                                                                                                                                                         inputs to a prediction error neuron with inhibitory syn-
                                      Such models p(s˜ ,ϑ) = p(s˜  | ϑ) p(ϑ) combine the likelihood p(s˜  | ϑ) of getting some data given                                                                                                                                                                          aptic inputs that are driven by top-down predictions 
                                      their causes and the prior beliefs about these causes, p(ϑ). The brain has to explain                                                                                                                                                                                        (see BOX 2 and REFS 41,42 for detailed discussion). The 
                                      complicated dynamics on continuous states with hierarchical or deep causal structure                                                                                                                                                                                         reciprocal exchange of bottom-up prediction errors and 
                                      and may use models with the following form                                                                                                                                                                                                                                   top-down predictions proceeds until prediction error 
                                                                                                                                                                                                                                                                                                                   is minimized at all levels and conditional expectations 





Ks
K
K
K
K U I Z X  θ  \ X  I Z X  θ  \ are optimized. This scheme has been invoked to explain …… · ·

K
K
K
K
K





 Z  H Z X  θ  Y 40,43 Z  H Z X  θ  Y many features of early visual responses and provides (i) (i) a plausible account of repetition suppression and mis- Here, g and f are continuous nonlinear functions of (hidden and causal) states, with 44 Nature Reviews | Neuroscience match responses in electrophysiology . FIGURE 1 pro- (i) (i) (i) parameters θ . The random fluctuations z(t) and w(t) play the part of observation (i) vides an example of perceptual categorization that uses noise at the sensory level and state noise at higher levels. Causal states v(t) link this scheme. hierarchical levels, where the output of one level provides input to the next. Hidden (i) Message passing of this sort is consistent with func- states x(t) link dynamics over time and endow the model with memory.
Gaussian assumptions about the random fluctuations specify the likelihood 45 tional asymmetries in real cortical hierarchies , where and Gaussian assumptions about state noise furnish empirical priors in terms of forward connections (which convey prediction errors) predicted motion. These assumptions are encoded by their precision (or inverse are driving and backwards connections (which model (i) variance), П (γ), which are functions of precision parameters γ. the nonlinear generation of sensory input) have both recognition dynamics and prediction error driving and modulatory characteristics46. This asym- If we assume that neuronal activity encodes the conditional expectation of states, then metrical message passing is also a characteristic feature recognition can be formulated as a gradient descent on free energy. Under Gaussian of adaptive resonance theory47,48, which has formal simi- assumptions, these recognition dynamics can be expressed compactly in terms larities to predictive coding. (i) (i) (i) of precision-weighted prediction errors ξ = П (ε) on the causal states and motion of In summary, the theme underlying the Bayesian brain hidden states. The ensuing equations (see the figure) suggest two neuronal populations and predictive coding is that the brain is an inference that exchange messages: causal or hidden-state units encoding expected states and engine that is trying to optimize probabilistic representa- error units encoding prediction error. Under hierarchical models, error units receive tions of what caused its sensory input. This optimization messages from the state units in the same level and the level above, whereas state units are driven by error units in the same level and the level below. These provide bottom-up can be finessed using a (variational free-energy) bound (i) messages that drive conditional expectations μ towards better predictions, which on surprise. In short, the free-energy principle entails (i) (i) explain away prediction error. These top-down predictions correspond to g(μ ) and f(μ ). the Bayesian brain hypothesis and can be implemented This scheme suggests that the only connections that link levels are forward connections by the many schemes considered in this field. Almost conveying prediction error to state units and reciprocal backward connections that invariably, these involve some form of message passing mediate predictions. See REFS 42,130 for details. Figure is modified from REF. 42. or belief propagation among brain areas or units. This 130 | FEBRuARy 2010 | voluME 11 www.nature.com/reviews/neuro © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS a Perceptual inference allows us to connect the free-energy principle to another Vocal centre Syrinx Sonogram principled approach to sensory processing, namely information theory. The principle of efficient coding The principle of efficient coding suggests that the brain optimizes the mutual information (that is, the mutual predictability) between the sensorium and its internal v representation, under constraints on the efficiency of v = 1 v 2 those representations. This line of thinking was articu- 49 lated by Barlow in terms of a redundancy reduction principle (or principle of efficient coding) and formal- 50 ized later in terms of the infomax principle . It has been 18x 18x applied in machine learning51, leading to methods 2 1 ˙x = f(x, v) = v x 2x x x 52 1 1 3 1 2 like independent component analysis , and in neuro- 2xx v x 1 2 2 3 biology, contributing to an understanding of the nature 5356 of neuronal responses . This principle is extremely
b Perceptual categorization effective in predicting the empirical characteristics of Song a Song b Song c 53 5,000 classical receptive fields and provides a principled explanation for sparse coding55 and the segregation of 57 processing streams in visual hierarchies . It has been 4,000 extended to cover dynamics and motion trajectories58,59 and even used to infer the metabolic constraints on neu- 3,000 60 equency (Hz) ronal processing . Fr At its simplest, the infomax principle says that
2,000 neuronal activity should encode sensory information in 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 an efficient and parsimonious fashion. It considers the Time (s) c mapping between one set of variables (sensory states) 50 3.5 and another (variables representing those states). At c first glance, this seems to preclude a probabilistic repre- 40 µv1 a 3 sentation, because this would involve mapping between 30 b sensory states and a probability density. However, the auses b 2.5 a c 20 c infomax principle can be applied to the sufficient sta- v ted 2 tistics of a recognition density. In this context, the info- 10 2 max principle becomes a special case of the free-energy
Estima0 µ 1 principle, which arises when we ignore uncertainty v 1.5 in probabilistic representations (and when there is no 10 20 1 action); see supplementary information s3 (box) for 0 0.2 0.4 0.6 0.8 1 10 15 20 25 30 35 mathematical details). This is easy to see by noting that Time (s) v1 sensory signals are generated by causes. This means that it
Figure 1 | Birdsongs and perceptual categorization. a | The generative model of is sufficient to represent the causes to predict these birdsong used in this simulation comprises a Lorenz attractor with two control parameters signals. More formally, the infomax principle can be Nature Reviews | Neuroscience (or causal states) (v ,v ), which, in turn, delivers two control parameters (not shown) to a understood in terms of the decomposition of free energy 1 2 synthetic syrinx to produce chirps that were modulated in amplitude and frequency (an into complexity and accuracy: mutual information is
example is shown as a sonogram). The chirps were then presented as a stimulus to a optimized when conditional expectations maximize synthetic bird to see whether it could infer the underlying causal states and thereby accuracy (or minimize prediction error), and efficiency categorize the song. This entails minimizing free energy by changing the internal is assured by minimizing complexity. This ensures that representation (μ ,μ ) of the control parameters. Examples of this perceptual inference or v1 v2 no excessive parameters are applied in the generative categorization are shown below. b | Three simulated songs are shown in sonogram format. model and leads to a parsimonious representation of Each comprises a series of chirps, the frequency and number of which fall progressively from song a to song c, as a causal state (known as the Raleigh number; v in part a) is sensory data that conforms to prior constraints on their 1 decreased. c | The graph on the left depicts the conditional expectations (μ ,μ ) of the causes. Interestingly, advanced model-optimization v1 v2 causal states, shown as a function of peristimulus time for the three songs. It shows that techniques use free-energy optimization to eliminate the causes are identified after around 600 ms with high conditional precision (90% 61 confidence intervals are shown in grey). The graph on the right shows the conditional redundant model parameters , suggesting that free- density on the causes shortly before the end of the peristimulus time (that is, the dotted energy optimization might provide a nice explanation line in the left panel). The blue dots correspond to conditional expectations and the grey for the synaptic pruning and homeostasis that take place 62 63 areas correspond to the 90% conditional confidence regions. Note that these encompass in the brain during neurodevelopment and sleep . the true values (red dots) of (v ,v ) that were used to generate the songs. These results The infomax principle pertains to a forward mapping 1 2 from sensory input to representations. How does this illustrate the nature of perceptual categorization under the inference scheme in BOX 2: here, recognition corresponds to mapping from a continuously changing and chaotic square with optimizing generative models, which map sensory input to a fixed point in perceptual space. Figure is reproduced, with permission, from causes to sensory inputs? These perspectives can be from REF. 130 © (2009) Elsevier. reconciled by noting that all recognition schemes based NATuRE REvIEWs | NeuroscieNce voluME 11 | FEBRuARy 2010 | 131 © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS on infomax can be cast as optimizing the parameters of a by synaptic efficacy (these are μ in BOX 2) have to be 64 θ generative model . For example, in sparse coding mod- optimized. This corresponds to optimizing connection 55 els , the implicit priors posit independent causes that strengths in the brain — that is, plasticity that under- are sampled from a heavy-tailed or sparse distribution42. lines learning. so what form would this learning take? It The fact that these models predict empirically observed transpires that a gradient descent on free energy (that is, receptive fields so well suggests that we are endowed changing connections to reduce free energy) is formally with (or acquire) prior expectations that the causes of identical to Hebbian plasticity28,42 (BOX 2). This is because our sensations are largely independent and sparse. the parameters of the generative model determine how In summary, the principle of efficient coding says expected states (synaptic activity) are mixed to form pre- that the brain should optimize the mutual information dictions. Put simply, when the presynaptic predictions between its sensory signals and some parsimonious and postsynaptic prediction errors are highly correlated, neuronal representations. This is the same as optimizing the connection strength increases, so that predictions the parameters of a generative model to maximize the can suppress prediction errors more efficiently. accuracy of predictions, under complexity constraints. In short, the formation of cell assemblies reflects the Both are mandated by the free-energy principle, which encoding of causal regularities. This is just a restate- can be regarded as a probabilistic generalization of the ment of cell assembly theory in the context of a specific Sufficient statistics infomax principle. We now turn to more biologically implementation (predictive coding) of the free-energy Quantities that are sufficient to inspired ideas about brain function that focus on neu- principle. It should be acknowledged that the learning parameterize a probability ronal dynamics and plasticity. This takes us deeper into rule in predictive coding is really a delta rule, which density (for example, mean and neurobiological mechanisms and the implementation of rests on Hebbian mechanisms; however, Hebbs wider covariance of a Gaussian the theoretical principles outlined above. notions of cell assemblies were formulated from a non- density). statistical perspective. Modern reformulations suggest Laplace assumption The cell assembly and correlation theory that both inference on states (that is, perception) and 65 (Or Laplace approximation or The cell assembly theory was proposed by Hebb and inference on parameters (that is, learning) minimize method.) A saddle-point entails Hebbian — or associative — plasticity, which is a free energy (that is, minimize prediction error) and approximation of the integral cornerstone of use-dependent or experience-dependent serve to bound surprising exchanges with the world. so of an exponential function, that plasticity66, the correlation theory of von de Malsburg67,68 what about synchronization and the selective enabling uses a second-order Taylor and other formal refinements to Hebbian plasticity of synapses? expansion. When the function 69 is a probability density, the per se . The cell assembly theory posits that groups of implicit assumption is that interconnected neurons are formed through a strength- Biased competition and attention the density is approximately ening of synaptic connections that depends on corre- Causal regularities encoded by synaptic efficacy
Gaussian. lated pre- and postsynaptic activity; that is, cells that fire control the deterministic evolution of states in the world. Predictive coding together wire together. This enables the brain to distil However, stochastic (that is, random) fluctuations in A tool used in signal processing statistical regularities from the sensorium. The correla- these states play an important part in generating sen- for representing a signal using tion theory considers the selective enabling of synaptic sory data. Their amplitude is usually represented as pre- a linear predictive (generative) efficacy and its plasticity (also known as metaplastic- cision (or inverse variance), which encodes the reliability model. It is a powerful speech 70 analysis technique and was ity ) by fast synchronous activity induced by different of prediction errors. Precision is important, especially first considered in vision to perceptual attributes of the same object (for example, a in hierarchical schemes, because it controls the relative explain lateral interactions in red bus in motion). This resolves a putative deficiency influence of bottom-up prediction errors and top-down the retina. of classical plasticity, which cannot ascribe a presynaptic predictions. so how is precision encoded in the brain?
Infomax input to a particular cause (for example, redness) in the In predictive coding, precision modulates the amplitude An optimization principle for world67. The correlation theory underpins theoretical of prediction errors (these are μ in BOX 2), so that pre- γ neural networks (or functions) treatments of synchronized brain activity and its role in diction errors with high precision have a greater impact that map inputs to outputs. It associating or binding attributes to specific objects or on units that encode conditional expectations. This says that the mapping should causes68,71. Another important field that rests on associa- means that precision corresponds to the synaptic gain of maximize the Shannon mutual tive plasticity is the use of attractor networks as models prediction error units. The most obvious candidates for information between the inputs 7274 and outputs, subject to of memory formation and retrieval . so how do corre- controlling gain (and implicitly encoding precision) are constraints and/or noise lations and associative plasticity figure in the free-energy classical neuromodulators like dopamine and acetylcho- processes. formulation? line, which provides a nice link to theories of attention 7577 Stochastic Hitherto, we have considered only inference on states and uncertainty . Another candidate is fast synchro- Governed by random effects. of the world that cause sensory signals, whereby condi- nized presynaptic input that lowers effective postsynaptic
tional expectations about states are encoded by synaptic membrane time constants and increases synchronous Biased competition 78 activity. However, the causes covered by the recognition gain . This fits comfortably with the correlation theory An attentional effect mediated density are not restricted to time-varying states (for and speaks to recent ideas about the role of synchronous by competitive interactions 79,80 among neurons representing example, the motion of an object in the visual field): activity in mediating attentional gain . visual stimuli; these they also include time-invariant regularities that endow In summary, the optimization of expected precision interactions can be biased in the world with causal structure (for example, objects in terms of synaptic gain links attention to synaptic gain favour of behaviourally relevant fall with constant acceleration). These regularities are and synchronization. This link is central to theories of stimuli by both spatial and parameters of the generative model and have to be attentional gain and biased competition8085, particularly non-spatial and both 86,87 bottom-up and top-down inferred by the brain — in other words, the conditional in the context of neuromodulation . The theories
processes. expectations of these parameters that may be encoded considered so far have dealt only with perception. 132 | FEBRuARy 2010 | voluME 11 www.nature.com/reviews/neuro © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS However, from the point of view of the free-energy value or surprise is determined by the form of an agents
principle, perception just makes free energy a good generative model and its implicit priors — these specify proxy for surprise. To actually reduce surprise we need the value of sensory states and, crucially, are heritable to act. In the next section, we retain a focus on cell through genetic and epigenetic mechanisms. This means assemblies but move to the selection and reinforcement that prior expectations (that is, the primary repertoire) of stimulusresponse links. can prescribe a small number of attractive states with innate value. In turn, this enables natural selection to Neural Darwinism and value learning optimize prior expectations and ensure they are con- In the theory of neuronal group selection88, the emergence sistent with the agents phenotype. Put simply, valuable of neuronal assemblies is considered in the light of selec- states are just the states that the agent expects to fre- tive pressure. The theory has four elements: epigenetic quent. These expectations are constrained by the form of mechanisms create a primary repertoire of neuronal its generative model, which is specified genetically and connections, which are refined by experience-dependent fulfilled behaviourally, under active inference. plasticity to produce a secondary repertoire of neuro- It is important to appreciate that prior expectations nal groups. These are selected and maintained through include not just what will be sampled from the world but reentrant signalling among neuronal groups. As in cell also how the world is sampled. This means that natural assembly theory, plasticity rests on correlated pre- and selection may equip agents with the prior expectation postsynaptic activity, but here it is modulated by value. that they will explore their environment until states value is signalled by ascending neuromodulatory trans- with innate value are encountered. We will look at this mitter systems and controls which neuronal groups more closely in the next section, where priors on motion are selected and which are not. The beauty of neural through state space are cast in terms of policies in
Darwinism is that it nests distinct selective processes reinforcement learning. within each other. In other words, it eschews a single unit Both neural Darwinism and the free-energy principle of selection and exploits the notion of meta-selection try to understand somatic changes in an individual in (the selection of selective mechanisms; for example, see the context of evolution: neural Darwinism appeals to REF. 89). In this context, (neuronal) value confers evolu- selective processes, whereas the free energy formulation tionary value (that is, adaptive fitness) by selecting neu- considers the optimization of ensemble or population
ronal groups that meditate adaptive stimulusstimulus dynamics in terms of entropy and surprise. The key associations and stimulusresponse links. The capacity theme that emerges here is that (heritable) prior expecta- of value to do this is assured by natural selection, in the tions can label things as innately valuable (unsurprising); sense that neuronal value systems are themselves subject but how can simply labelling states engender adaptive to selective pressure. behaviour? In the next section, we return to reinforce- 90 This theory, particularly value-dependent learning , ment learning and related formulations of action that try has deep connections with reinforcement learning and to explain adaptive behaviour purely in terms of labels Reentrant signalling related approaches in engineering (see below), such as or cost functions. Reciprocal message passing dynamic programming and temporal difference mod- 91,92 among neuronal groups. els . This is because neuronal value systems reinforce Optimal control theory and game theory connections to themselves, thereby enabling the brain value is central to theories of brain function that are Reinforcement learning to label a sensory state as valuable if, and only if, it leads to based on reinforcement learning and optimum con- An area of machine learning another valuable state. This ensures that agents move trol. The basic notion that underpins these treatments concerned with how an agent through a succession of states that have acquired value to is that the brain optimizes value, which is expected maximizes long-term reward. access states (rewards) with genetically specified innate reward or utility (or its complement — expected loss Reinforcement learning algorithms attempt to find a value. In short, the brain maximizes value, which may be or cost). This is seen in behavioural psychology as rein- policy that maps states of the 98 reflected in the discharge of value systems (for example, forcement learning , in computational neuroscience world to actions performed by dopaminergic systems9296). so how does this relate to and machine learning as variants of dynamic program- the agent. the optimization of free energy? ming such as temporal difference learning99101, and in Optimal control theory The answer is simple: value is inversely proportional economics as expected utility theory102. The notion of
An optimization method to surprise, in the sense that the probability of a pheno- an expected reward or cost is crucial here; this is the (based on the calculus of type being in a particular state increases with the value cost expected over future states, given a particular policy variations) for deriving an of that state. Furthermore, the evolutionary value of that prescribes action or choices. A policy specifies the optimal control law in a a phenotype is the negative surprise averaged over all states to which an agent will move from any given state dynamical system. A control problem includes a cost the states it experiences, which is simply its negative (motion through state space in continuous time). This function that is a function of entropy. Indeed, the whole point of minimizing free policy has to access sparse rewarding states using a cost state and control variables. energy (and implicitly entropy) is to ensure that agents function, which only labels states as costly or not. The Bellman equation spend most of their time in a small number of valuable problem of how the policy is optimized is formalized (Or dynamic programming states. This means that free energy is the complement of in optimal control theory as the Bellman equation and its equation.) Named after value, and its long-term average is the complement of variants99 (see supplementary information s4 (box)), Richard Bellman, it is a adaptive fitness (also known as free fitness in evolution- which express value as a function of the optimal policy necessary condition for ary biology97). But how do agents know what is valu- and a cost function. If one can solve the Bellman equa- optimality associated with able? In other words, how does one generation tell the tion, one can associate each sensory state with a value dynamic programming in optimal control theory. next which states have value (that is, are unsurprising)? and optimize the policy by ensuring that the next state NATuRE REvIEWs | NeuroscieNce voluME 11 | FEBRuARy 2010 | 133 © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS Optimal decision theory is the most valuable of the available states. In general, because it explains why agents must minimize expected (Or game theory.) An area of it is impossible to solve the Bellman equation exactly, cost. Furthermore, free energy provides a quantitative applied mathematics but several approximations exist, ranging from simple and seamless connection between the cost functions concerned with identifying the 98 values, uncertainties and other RescorlaWagner models to more comprehensive for- of reinforcement learning and value in evolutionary 100 constraints that determine an mulations like Q-learning . Cost also has a key role in biology. Finally, the dynamical perspective provides a optimal decision. Bayesian decision theory, in which optimal decisions mechanistic insight into how policies are specified in the 99 minimize expected cost in the context of uncertainty brain: according to the principle of optimality cost is the Gradient ascent about outcomes; this is central to optimal decision theory rate of change of value (see supplementary information (Or method of steepest 102104 ascent.) A first-order (game theory) and behavioural economics . s4 (box)), which depends on changes in sensory states. optimization scheme that finds so what does free energy bring to the table? If one This suggests that optimal policies can be prescribed by a maximum of a function by assumes that the optimal policy performs a gradient prior expectations about the motion of sensory states. changing its arguments in ascent on value, then it is easy to show that value is Put simply, priors induce a fixed-point attractor, and proportion to the gradient of inversely proportional to surprise (see supplementary when the states arrive at the fixed point, value will stop the function at the current information s4 (box)). This means that free energy is changing and cost will be minimized. A simple exam- value. In short, a hill-climbing (an upper bound on) expected cost, which makes sense ple is shown in FIG. 2, in which a cued arm movement scheme. The opposite scheme is a gradient descent. as optimal control theory assumes that action mini- is simulated using only prior expectations that the arm mizes expected cost, whereas the free-energy principle will be drawn to a fixed point (the target). This figure states that it minimizes free energy. This is important illustrates how computational motor control105109 can be formulated in terms of priors and the suppression of sensory prediction errors (K.J.F., J. Daunizeau, J. Kilner Predictions and s.J. Kiebel, unpublished observations). More gener- (2) ξ ally, it shows how rewards and goals can be considered (1) v Prediction errors 16 ξx as prior expectations that an action is obliged to fulfil
(1) (see also REF. 110). It also suggests how natural selection μ v (1) (1) could optimize behaviour through the genetic specifi- μ ξ x v cation of inheritable or innate priors that constrain the Movement learning of empirical priors (BOX 2) and subsequent goal- V trajectory directed action. s =+ w visual J visual It should be noted that just expecting to be attracted (0, 0) to some states may not be sufficient to attain those states. Motor signals x This is because one may have to approach attractors vicar- 1 x iously through other states (for example, to avoid obsta- s 1 J cles) or conform to physical constraints on action. These =+ w 1 prop x prop 2 V = (v, v , v ) are some of the more difficult problems of accessing
(1) 1 2 3 ξv J distal rewards that reinforcement learning and opti- x2 2 mum control contend with. In these circumstances, a Action an examination of the density dynamics, on which the
J = J + J = ( j , j ) ˙a = −∂ εTξ Jointed arm 1 2 1 2 a free-energy principle is based, suggests that it is sufficient Figure 2 | A demonstration of cued reaching movements. The lower right part of the to keep moving until an a priori attractor is encountered figure shows a motor plant, comprising a two-jointed arm with two hidden states, each of (see supplementary information s5 (box)). This entails which corresponds to a particular angular position of the two joints; the current position destroying unexpected (costly) fixed points in the envi- Nature Reviews | Neuroscience of the finger (red circle) is the sum of the vectors describing the location of each joint. ronment by making them unstable (like shifting to a new Here, causal states in the world are the position and brightness of the target (green position when sitting uncomfortably). Mathematically, circle). The arm obeys Newtonian mechanics, specified in terms of angular inertia and this means adopting a policy that ensures a positive friction. The left part of the figure illustrates that the brain senses hidden states directly divergence in costly states (intuitively, this is like being in terms of proprioceptive input (S ) that signals the angular positions (x ,x ) of the prop 1 2 pushed through a liquid with negative viscosity or
joints and indirectly through seeing the location of the finger in space (J ,J ). In addition, 1 2 friction). see FIG. 3 for a solution to the classical
through visual input (S ) the agent senses the target location (v ,v ) and brightness (v ). visual 1 2 3 mountain car problem using a simple prior that induces Sensory prediction errors are passed to higher brain levels to optimize the conditional this sort of policy. This prior is on motion through state expectations of hidden states (that is, the angular position of the joints) and causal (that is, target) states. The ensuing predictions are sent back to suppress sensory prediction space (that is, changes in states) and enforces exploration
errors. At the same time, sensory prediction errors are also trying to suppress themselves until an attractive state is found. Priors of this sort may by changing sensory input through action. The grey and black lines denote reciprocal provide a principled way to understand the exploration message passing among neuronal populations that encode prediction error and 111113 exploitation trade-off and related issues in evolu- conditional expectations; this architecture is the same as that depicted in BOX 2. The 114 blue lines represent descending motor control signals from sensory prediction-error tionary biology . The implicit use of priors to induce units. The agents generative model included priors on the motion of hidden states that dynamical instability also provides a key connection effectively engage an invisible elastic band between the finger and target (when the to dynamical systems theory approaches to the brain target is illuminated). This induces a prior expectation that the finger will be drawn to that emphasize the importance of itinerant dynamics, the target, when cued appropriately. The insert shows the ensuing movement trajectory metastability, self-organized criticality and winner- caused by action. The red circles indicate the initial and final positions of the finger, less competition115123. These dynamical phenomena which reaches the target (green circle) quickly and smoothly; the blue line is the have a key role in synergetic and autopoietic accounts of
simulated trajectory. adaptive behaviour5,124,125. 134 | FEBRuARy 2010 | voluME 11 www.nature.com/reviews/neuro © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS ab The mountain car problem Loss functions (priors) Conditional expectations 0.7 5 30 0.6 0 25 c(t) 0.5 –5 es20 t a 0.4 e10 c(x) 15 c or15 ted st Height0.3 ϕ(x) F 10 0.2 20 Estima 5 x μ(t) 0.1 25 0 0 30 5 -2 -1 0 12 2 1 012 0 20 40 60 80 100120 Position (x) Position (x) Time (seconds) Principle of optimality An optimal policy has Equations of motion Trajectories Action the property that whatever the initial state and initial decision, ˙x x′ 2 3 the remaining decisions must f == 1 x′ ˙x′ −∇ϕ − ⁄8 x′ + σ(a) 2 constitute an optimal policy x with regard to the state 1 l a(t) resulting from the first decision. 1 Explorationexploitation 0 ol signa0 elocity trade-off V ontr Involves a balance between C1 exploration (of uncharted 1 2 territory) and exploitation (of current knowledge). In –2 3 reinforcement learning, it has –2 1 012 0 20 40 60 80 100120 been studied mainly through Position (x) Time (seconds) the multi-armed bandit problem. Figure 3 | solving the mountain car problem with prior expectations. a | How paradoxical but adaptive behaviour (for Nature Reviews | Neuroscience example, moving away from a target to ensure that it is secured later) emerges from simple priors on the motion of hidden Dynamical systems theory states in the world. Shown is the landscape or potential energy function (with a minimum at position x = 0.5) that exerts An area of applied forces on a mountain car. The car is shown at the target position on the hill at x =1, indicated by the red circle. The equations mathematics that describes of motion of the car are shown below the plot. Crucially, at x = 0 the force on the car cannot be overcome by the agent, the behaviour of complex because a squashing function 1≤σ≤1 is applied to action to prevent it being greater than 1. This means that the agent can (possibly chaotic) dynamical access the target only by starting halfway up the left hill to gain enough momentum to carry it up the other side. b | The systems as described by results of active inference under priors that destabilize fixed points outside the target domain. The priors are encoded in a differential or difference cost function c(x) (top left), which acts like negative friction. When friction is negative the car expects to go faster (see equations. Supplementary information S5 (box) for details). The inferred hidden states (upper right: position in blue, velocity in green Synergetics and negative dissipation in red) show that the car explores its landscape until it encounters the target, and that friction then Concerns the self-organization increases (that is, cost decreases) dramatically to prevent the car from escaping the target (by falling down the hill). The of patterns and structures in ensuing trajectory is shown in blue (bottom left). The paler lines provide exemplar trajectories from other trials, with open systems far from different starting positions. In the real world, friction is constant. However, the car expects friction to change as it changes thermodynamic equilibrium. It position, thus enforcing exploration or exploitation. These expectations are fulfilled by action (lower right). rests on the order parameter concept, which was generalized by Haken to the enslaving principle: that is, the dynamics In summary, optimal control and decision (game) Conclusions and future directions of fast-relaxing (stable) modes theory start with the notion of cost or utility and try to Although contrived to highlight commonalities, this are completely determined by the slow dynamics of order construct value functions of states, which subsequently Review suggests that many global theories of brain parameters (the amplitudes of guide action. The free-energy formulation starts with function can be united under a Helmholtzian percep- unstable modes). a free-energy bound on the value of states, which is tive of the brain as a generative model of the world it 18,20,21,25 Autopoietic specified by priors on the motion of hidden environ- inhabits (FIG. 4); notable examples include the Referring to the fundamental mental states. These priors can incorporate any cost integration of the Bayesian brain and computational dialectic between structure function to ensure that costly states are avoided. states motor control theory, the objective functions shared and function. with minimum cost can be set (by learning or evolu- by predictive coding and the infomax principle, Helmholtzian tion) in terms of prior expectations about motion and hierarchical inference and theories of attention, the
Refers to a device or scheme the attractors that ensue. In this view, the problem of embedding of perception in natural selection and
that uses a generative model to finding sparse rewards in the environment is natures the link between optimum control and more exotic furnish a recognition density solution to the problem of how to minimize the entropy phenomena in dynamical systems theory. The constant and learns hidden structures in (average surprise or free energy) of an agents states: by theme in all these theories is that the brain optimizes data by optimizing the ensuring they occupy a small set of attracting (that is, a (free-energy) bound on surprise or its complement, parameters of generative models. rewarding) states. value. This manifests as perception (so as to change NATuRE REvIEWs | NeuroscieNce voluME 11 | FEBRuARy 2010 | 135 © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS Attention and biased competition Computational motor control μ = arg min dtF T γ ∫ ˙a = −∂aε ξ Optimization of synaptic gain Minimization of sensory representing the precision prediction errors (salience) of predictions Predictive coding and hierarchical inference (i) (i)((i) i) (i + 1) = Dμ ∂ ε Tξ ξ ˙μ Associative plasticity v v v v μ = −∂ εTξ Minimization of prediction error Optimal control and value learning θij θij with recurrent message passing ~ Optimization of synaptic efficacy a, μ = arg max V (s | m) The Bayesian brain hypothesis Optimization of a free-energy μ = arg min D ~ bound on surprise or value Perceptual learning and memory (q(ϑ) || (p(ϑ | s)) KL = arg min dtF Minimizing the difference between a μ θ ∫ recognition density and the conditional Optimization of synaptic efficacy density on sensory causes to represent causal structure in the sensorium The free-energy principle Infomax and the redundancy ~ a, μ, m = arg min F (s, Probabilistic neuronal coding μ | m) minimization principle Minimization of the free energy of ~ μ ) H(μ)} q(ϑ ) = N ( μ, Σ) sensations and the representation μ = arg max {I (s, Encoding a recognition density of their causes Maximization of the mutual in terms of conditional information between sensations expectations and uncertainty Model selection and evolution and representations m = arg min dtF ∫ Optimizing the agents model and priors through neurodevelopment and natural selection Figure 4 | The free-energy principle and other theories. Some of the theoretical constructs considered in this Review Nature Reviews | Neuroscience and how they relate to the free-energy principle (centre). The variables are described in BOXES 1,2 and a full explanation of the equations can be found in the Supplementary information S1S4 (boxes). predictions) or action (so as to change the sensations to old problems that might call for a reappraisal of
that are predicted). Crucially, these predictions depend conventional notions, particularly in reinforcement on prior expectations (that furnish policies), which learning and motor control. are optimized at different (somatic and evolutionary) If the arguments underlying the free-energy principle
timescales and define what is valuable. hold, then the real challenge is to understand how it What does the free-energy principle portend for the manifests in the brain. This speaks to a greater appre- 41 future? If its main contribution is to integrate estab- ciation of hierarchical message passing , the func- lished theories, then the answer is probably not a lot. tional role of specific neurons and microcircuits and Conversely, it may provide a framework in which cur- the dynamics they support (for example, what is the rent debates could be resolved, for example whether relationship between predictive coding, attention 129 dopamine encodes reward prediction error or sur- and dynamic co ordination in the brain? ). Beyond
prise126,127 — this is particularly important for under- neuroscience, many exciting applications in engineering, standing conditions like addiction, Parkinsons disease robotics, embodied cognition and evolutionary biology and schizophrenia. Indeed, the free-energy formulation suggest themselves; although fanciful, it is not difficult to has already been used to explain the positive symptoms imagine building little free-energy machines that garner 128 of schizophrenia in terms of false inference . The free- and model sensory information (like our children) to energy formulation could also provide new approaches maximize the evidence for their own existence. 1. Huang, G. Is this a unified theory of the brain? paper focuses on perception and the Physics, Chemistry and Biology 3rd edn (Springer, New Scientist 2658, 3033 (2008). neurobiological infrastructures involved. New York, 1983). 2. Friston K., Kilner, J. & Harrison, L. A free energy 3. Ashby, W. R. Principles of the self-organising dynamic 6. Kauffman, S. The Origins of Order: SelfOrganization principle for the brain. J. Physiol. Paris 100, 7087 system. J. Gen. Psychol. 37, 125128 (1947). and Selection in Evolution (Oxford Univ. Press, Oxford, (2006). 4. Nicolis, G. & Prigogine, I. SelfOrganisation in Non 1993). An overview of the free-energy principle that Equilibrium Systems (Wiley, New York, 1977). 7. Bernard, C. Lectures on the Phenomena Common
describes its motivation and relationship to 5. Haken, H. Synergistics: an Introduction. Non to Animals and Plants (Thomas, Springfield,
generative models and predictive coding. This Equilibrium Phase Transition and SelfOrganisation in 1974). 136 | FEBRuARy 2010 | voluME 11 www.nature.com/reviews/neuro © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS 8. Applebaum, D. Probability and Information: an 36. Zemel, R., Dayan, P. & Pouget, A. Probabilistic 60. Laughlin, S. B. Efficiency and complexity in neural Integrated Approach (Cambridge Univ. Press, interpretation of population code. Neural Comput. 10, coding. Novartis Found. Symp. 239, 177187 Cambridge, UK, 2008). 403430 (1998). (2001). 9. Evans, D. J. A non-equilibrium free energy theorem 37. Paulin, M. G. Evolution of the cerebellum as a 61. Tipping, M. E. Sparse Bayesian learning and the for deterministic systems. Mol. Physics 101, neuronal machine for Bayesian state estimation. Relevance Vector Machine. J. Machine Learn. Res. 1, 1555111554 (2003). J. Neural Eng. 2, S219S234 (2005). 211244 (2001). 10. Crauel, H. & Flandoli, F. Attractors for random 38. Ma, W. J., Beck, J. M., Latham, P. E. & Pouget, A. 62. Paus, T., Keshavan, M. & Giedd, J. N. Why do many dynamical systems. Probab. Theory Relat. Fields 100, Bayesian inference with probabilistic population psychiatric disorders emerge during adolescence? 365393 (1994). codes. Nature Neurosci. 9, 14321438 (2006). Nature Rev. Neurosci. 9, 947957 (2008). 11. Feynman, R. P. Statistical Mechanics: a Set of Lectures 39. Friston, K., Mattout, J., Trujillo-Barreto, N., 63. Gilestro, G. F., Tononi, G. & Cirelli, C. Widespread (Benjamin, Reading, Massachusetts, 1972). Ashburner, J. & Penny, W. Variational free energy and changes in synaptic markers as a function of sleep and 12. Hinton, G. E. & von Cramp, D. Keeping neural the Laplace approximation. Neuroimage 34, wakefulness in Drosophila. Science 324, 109112 networks simple by minimising the description length 220234 (2007). (2009). of weights. Proc. 6th Annu. ACM Conf. Computational 40. Rao, R. P. & Ballard, D. H. Predictive coding in the 64. Roweis, S. & Ghahramani, Z. A unifying review of Learning Theory 513 (1993). visual cortex: a functional interpretation of some linear Gaussian models. Neural Comput. 11, 305345 13. MacKay. D. J. C. Free-energy minimisation algorithm extra-classical receptive field effects. Nature Neurosci. (1999). for decoding and cryptoanalysis. Electron. Lett. 31, 2, 7987 (1998). 65. Hebb, D. O. The Organization of Behaviour (Wiley, 445447 (1995). Applies predictive coding to cortical processing to New York, 1949). 14. Neal, R. M. & Hinton, G. E. in Learning in Graphical provide a compelling account of extra-classical 66. Paulsen, O. & Sejnowski, T. J. Natural patterns of Models (ed. Jordan, M. I.) 355368 (Kluwer receptive fields in the visual system. It emphasizes activity and long-term synaptic plasticity. Curr. Opin. Academic, Dordrecht, 1998). the importance of top-down projections in Neurobiol. 10, 172179 (2000). 15. Itti, L. & Baldi, P. Bayesian surprise attracts human providing predictions, by modelling perceptual 67. von der Malsburg, C. The Correlation Theory of Brain attention. Vision Res. 49, 12951306 (2009). inference. Function. Internal Report 8182, Dept. Neurobiology, 16. Friston, K., Daunizeau, J. & Kiebel, S. Active inference 41. Mumford, D. On the computational architecture of the Max-Planck-Institute for Biophysical Chemistry or reinforcement learning? PLoS ONE 4, e6421 neocortex. II. The role of cortico-cortical loops. Biol. (1981). (2009). Cybern. 66, 241251 (1992). 68. Singer, W. & Gray, C. M. Visual feature integration and 17. Knill, D. C. & Pouget, A. The Bayesian brain: the role 42. Friston, K. Hierarchical models in the brain. PLoS the temporal correlation hypothesis. Annu. Rev. of uncertainty in neural coding and computation. Comput. Biol. 4, e1000211 (2008). Neurosci. 18, 555586 (1995). Trends Neurosci. 27, 712719 (2004). 43. Murray, S. O., Kersten, D., Olshausen, B. A., Schrater, P. 69. Bienenstock, E. L., Cooper, L. N. & Munro, P. W. A nice review of Bayesian theories of perception & Woods, D. L. Shape perception reduces activity in Theory for the development of neuron selectivity: and sensorimotor control. Its focus is on Bayes human primary visual cortex. Proc. Natl Acad. Sci. orientation specificity and binocular interaction in optimality in the brain and the implicit nature of USA 99, 1516415169 (2002). visual cortex. J. Neurosci. 2, 3248 (1982). neuronal representations. 44. Garrido, M. I., Kilner, J. M., Kiebel, S. J. & Friston, 70. Abraham, W. C. & Bear, M. F. Metaplasticity: the 18. von Helmholtz, H. in Treatise on Physiological Optics K. J. Dynamic causal modeling of the response to plasticity of synaptic plasticity. Trends Neurosci. 19, Vol. III 3rd edn (Voss, Hamburg, 1909). frequency deviants. J. Neurophysiol. 101, 126130 (1996). 19. MacKay, D. M. in Automata Studies (eds Shannon, 26202631 (2009). 71. Pareti, G. & De Palma, A. Does the brain oscillate? C. E. & McCarthy, J.) 235251 (Princeton Univ. Press, 45. Sherman, S. M. & Guillery, R. W. On the actions that The dispute on neuronal synchronization. Neurol. Sci. Princeton, 1956). one nerve cell can have on another: distinguishing 25, 4147 (2004). 20. Neisser, U. Cognitive Psychology “drivers” from “modulators”. Proc. Natl Acad. Sci. USA 72. Leutgeb, S., Leutgeb, J. K., Moser, M. B. & Moser, E. I. (Appleton-Century-Crofts, New York, 1967). 95, 71217126 (1998). Place cells, spatial maps and the population code for 21. Gregory, R. L. Perceptual illusions and brain models. 46. Angelucci, A. & Bressloff, P. C. Contribution of memory. Curr. Opin. Neurobiol. 15, 738746
Proc. R. Soc. Lond. B Biol. Sci. 171, 179196 (1968). feedforward, lateral and feedback connections to the (2005). 22. Gregory, R. L. Perceptions as hypotheses. Philos. classical receptive field center and extra-classical 73. Durstewitz, D. & Seamans, J. K. Beyond bistability: Trans. R. Soc. Lond. B Biol. Sci. 290, 181197 (1980). receptive field surround of primate V1 neurons. biophysics and temporal dynamics of working memory. 23. Ballard, D. H., Hinton, G. E. & Sejnowski, T. J. Parallel Prog. Brain Res. 154, 93120 (2006). Neuroscience 139, 119133 (2006). visual computation. Nature 306, 2126 (1983). 47. Grossberg, S. Towards a unified theory of neocortex: 74. Anishchenko, A. & Treves, A. Autoassociative memory 24. Kawato, M., Hayakawa, H. & Inui, T. A forward-inverse laminar cortical circuits for vision and cognition. retrieval and spontaneous activity bumps in small- optics model of reciprocal connections between visual Prog. Brain Res. 165, 79104 (2007). world networks of integrate-and-fire neurons. areas. Network: Computation in Neural Systems 4, 48. Grossberg, S. & Versace, M. Spikes, synchrony, and J. Physiol. Paris 100, 225236 (2006). 415422 (1993). attentive learning by laminar thalamocortical circuits. 75. Abbott, L. F., Varela, J. A., Sen, K. & Nelson, S. B. 25. Dayan, P., Hinton, G. E. & Neal, R. M. The Helmholtz Brain Res. 1218, 278312 (2008). Synaptic depression and cortical gain control. Science machine. Neural Comput. 7, 889904 (1995). 49. Barlow, H. in Sensory Communication (ed. Rosenblith, W.) 275, 220224 (1997). This paper introduces the central role of generative 217234 (MIT Press, Cambridge, Massachusetts, 76. Yu, A. J. & Dayan, P. Uncertainty, neuromodulation models and variational approaches to hierarchical 1961). and attention. Neuron 46, 681692 (2005). self-supervised learning and relates this to the 50. Linsker, R. Perceptual neural organisation: some 77. Doya, K. Metalearning and neuromodulation. Neural function of bottom-up and top-down cortical approaches based on network models and Netw. 15, 495506 (2002). processing pathways. information theory. Annu. Rev. Neurosci. 13, 78. Chawla, D., Lumer, E. D. & Friston, K. J. The 26. Lee, T. S. & Mumford, D. Hierarchical Bayesian 257281 (1990). relationship between synchronization among neuronal inference in the visual cortex. J. Opt. Soc. Am. A Opt. 51. Oja, E. Neural networks, principal components, and populations and their mean activity levels. Neural Image Sci. Vis. 20, 14341448 (2003). subspaces. Int. J. Neural Syst. 1, 6168 (1989). Comput. 11, 13891411 (1999). 27. Kersten, D., Mamassian, P. & Yuille, A. Object 52. Bell, A. J. & Sejnowski, T. J. An information 79. Fries, P., Womelsdorf, T., Oostenveld, R. & Desimone, R. perception as Bayesian inference. Annu. Rev. Psychol. maximisation approach to blind separation and blind The effects of visual stimulation and selective visual 55, 271304 (2004). de-convolution. Neural Comput. 7, 11291159 attention on rhythmic neuronal synchronization in 28. Friston, K. J. A theory of cortical responses. Philos. (1995). macaque area V4. J. Neurosci. 28, 48234835 Trans. R. Soc. Lond. B Biol. Sci. 360, 815836 53. Atick, J. J. & Redlich, A. N. What does the retina know (2008). (2005). about natural scenes? Neural Comput. 4, 196210 80. Womelsdorf, T. & Fries, P. Neuronal coherence during 29. Beal, M. J. Variational Algorithms for Approximate (1992). selective attentional processing and sensory-motor Bayesian Inference. Thesis, University College London 54. Optican, L. & Richmond, B. J. Temporal encoding of integration. J. Physiol. Paris 100, 182193 (2006). (2003). two-dimensional patterns by single units in primate 81. Desimone, R. Neural mechanisms for visual memory 30. Efron, B. & Morris, C. Steins estimation rule and its inferior cortex. III Information theoretic analysis. and their role in attention. Proc. Natl Acad. Sci. USA competitors an empirical Bayes approach. J. Am. J. Neurophysiol. 57, 132146 (1987). 93, 1349413499 (1996). Stats. Assoc. 68, 117130 (1973). 55. Olshausen, B. A. & Field, D. J. Emergence of simple- A nice review of mnemonic effects (such as 31. Kass, R. E. & Steffey, D. Approximate Bayesian cell receptive field properties by learning a sparse repetition suppression) on neuronal responses and inference in conditionally independent hierarchical code for natural images. Nature 381, 607609 how they bias the competitive interactions between models (parametric empirical Bayes models). J. Am. (1996). stimulus representations in the cortex. It provides Stat. Assoc. 407, 717726 (1989). 56. Simoncelli, E. P. & Olshausen, B. A. Natural image a good perspective on attentional mechanisms in 32. Zeki, S. & Shipp, S. The functional logic of cortical statistics and neural representation. Annu. Rev. the visual system that is empirically grounded. connections. Nature 335, 311317 (1988). Neurosci. 24, 11931216 (2001). 82. Treisman, A. Feature binding, attention and object Describes the functional architecture of cortical A nice review of information theory in visual perception. Philos. Trans. R. Soc. Lond. B Biol. Sci. hierarchies with a focus on patterns of anatomical processing. It covers natural scene statistics and 353, 12951306 (1998). connections in the visual cortex. It emphasizes the empirical tests of the efficient coding hypothesis in 83. Maunsell, J. H. & Treue, S. Feature-based attention in role of functional segregation and integration (that individual neurons and populations of neurons. visual cortex. Trends Neurosci. 29, 317322 (2006). is, message passing among cortical areas). 57. Friston, K. J. The labile brain. III. Transients and 84. Spratling, M. W. Predictive-coding as a model of 33. Felleman, D. J. & Van Essen, D. C. Distributed spatio-temporal receptive fields. Philos. Trans. R. Soc. biased competition in visual attention. Vision Res. 48, hierarchical processing in the primate cerebral cortex. Lond. B Biol. Sci. 355, 253265 (2000). 13911408 (2008). Cereb. Cortex 1, 147 (1991). 58. Bialek, W., Nemenman, I. & Tishby, N. Predictability, 85. Reynolds, J. H. & Heeger, D. J. The normalization 34. Mesulam, M. M. From sensation to cognition. Brain complexity, and learning. Neural Comput. 13, model of attention. Neuron 61, 168185 (2009). 121, 10131052 (1998). 24092463 (2001). 86. Schroeder, C. E., Mehta, A. D. & Foxe, J. J. 35. Sanger, T. Probability density estimation for the 59. Lewen, G. D., Bialek, W. & de Ruyter van Steveninck, Determinants and mechanisms of attentional interpretation of neural population codes. R. R. Neural coding of naturalistic motion stimuli. modulation of neural processing. Front. Biosci. 6, J. Neurophysiol. 76, 27902793 (1996). Network 12, 317329 (2001). D672D684 (2001). NATuRE REvIEWs | NeuroscieNce voluME 11 | FEBRuARy 2010 | 137 © 2010 Macmillan Publishers Limited. All rights reserved REVIEWS 87. Hirayama, J., Yoshimoto, J. & Ishii, S. Bayesian 106. Todorov, E. & Jordan, M. I. Smoothness maximization 119. Bressler, S. L. & Tognoli, E. Operational principles of representation learning in the cortex regulated by along a predefined path accurately predicts the speed neurocognitive networks. Int. J. Psychophysiol. 60, acetylcholine. Neural Netw. 17, 13911400 (2004). profiles of complex arm movements. J. Neurophysiol. 139148 (2006). 88. Edelman, G. M. Neural Darwinism: selection and 80, 696714 (1998). 120. Werner, G. Brain dynamics across levels of reentrant signaling in higher brain function. Neuron 107. Tseng, Y. W., Diedrichsen, J., Krakauer, J. W., organization. J. Physiol. Paris 101, 273279 (2007). 10, 115125 (1993). Shadmehr, R. & Bastian, A. J. Sensory prediction- 121. Pasquale, V., Massobrio, P., Bologna, L. L., 89. Knobloch, F. Altruism and the hypothesis of meta- errors drive cerebellum-dependent adaptation of Chiappalone, M. & Martinoia, S. Self-organization and selection in human evolution. J. Am. Acad. reaching. J. Neurophysiol. 98, 5462 (2007). neuronal avalanches in networks of dissociated cortical Psychoanal. 29, 339354 (2001). 108. Bays, P. M. & Wolpert, D. M. Computational neurons. Neuroscience 153, 13541369 (2008). 90. Friston, K. J., Tononi, G., Reeke, G. N. Jr, Sporns, O. & principles of sensorimotor control that minimize 122. Kitzbichler, M. G., Smith, M. L., Christensen, S. R. & Edelman, G. M. Value-dependent selection in the uncertainty and variability. J. Physiol. 578, 387396 Bullmore, E. Broadband criticality of human brain brain: simulation in a synthetic neural model. (2007). network synchronization. PLoS Comput. Biol. 5, Neuroscience 59, 229243 (1994). A nice overview of computational principles in e1000314 (2009). 91. Sutton, R. S. & Barto, A. G. Toward a modern theory of motor control. Its focus is on representing 123. Rabinovich, M., Huerta, R. & Laurent, G. Transient adaptive networks: expectation and prediction. uncertainty and optimal estimation when dynamics for neural processing. Science 321 4850 Psychol. Rev. 88, 135170 (1981). extracting the sensory information required for (2008). 92. Montague, P. R., Dayan, P., Person, C. & Sejnowski, motor planning. 124. Tschacher, W. & Hake, H. Intentionality in non- T. J. Bee foraging in uncertain environments using 109. Shadmehr, R. & Krakauer, J. W. A computational equilibrium systems? The functional aspects of self- predictive Hebbian learning. Nature 377, 725728 neuroanatomy for motor control. Exp. Brain Res. 185, organised pattern formation. New Ideas Psychol. 25, (1995). 359381 (2008). 115 (2007). A computational treatment of behaviour that 110. Verschure, P. F., Voegtlin, T. & Douglas, R. J. 125. Maturana, H. R. & Varela, F. De máquinas y seres combines ideas from optimal control theory and Environmentally mediated synergy between vivos (Editorial Universitaria, Santiago, 1972).
dynamic programming with the neurobiology of perception and behaviour in mobile robots. Nature English translation available in Maturana, H. R. & reward. This provided an early example of value 425, 620624 (2003). Varela, F. in Autopoiesis and Cognition (Reidel, learning in the brain. 111. Cohen, J. D., McClure, S. M. & Yu, A. J. Should I stay Dordrecht, 1980). 93. Schultz, W. Predictive reward signal of dopamine or should I go? How the human brain manages the 126. Fiorillo, C. D., Tobler, P. N. & Schultz, W. Discrete neurons. J. Neurophysiol. 80, 127 (1998). trade-off between exploitation and exploration. Philos. coding of reward probability and uncertainty by 94. Daw, N. D. & Doya, K. The computational Trans. R. Soc. Lond. B Biol. Sci. 362, 933942 dopamine neurons. Science 299, 18981902 neurobiology of learning and reward. Curr. Opin. (2007). (2003). Neurobiol. 16, 199204 (2006). 112. Ishii, S., Yoshida, W. & Yoshimoto, J. Control of 127. Niv, Y., Duff, M. O. & Dayan, P. Dopamine,
95. Redgrave, P. & Gurney, K. The short-latency dopamine exploitation-exploration meta-parameter in uncertainty and TD learning. Behav. Brain Funct. 1, 6 signal: a role in discovering novel actions? Nature Rev. reinforcement learning. Neural Netw. 15, 665687 (2005). Neurosci. 7, 967975 (2006). (2002). 128. Fletcher, P. C. & Frith, C. D. Perceiving is believing: a 96. Berridge, K. C. The debate over dopamines role in 113. Usher, M., Cohen, J. D., Servan-Schreiber, D., Bayesian approach to explaining the positive reward: the case for incentive salience. Rajkowski, J. & Aston-Jones, G. The role of locus symptoms of schizophrenia. Nature Rev. Neurosci. 10, Psychopharmacology (Berl.) 191, 391431 (2007). coeruleus in the regulation of cognitive performance. 4858 (2009). 97. Sella, G. & Hirsh, A. E. The application of statistical Science 283, 549554 (1999). 129. Phillips, W. A. & Silverstein, S. M. Convergence of physics to evolutionary biology. Proc. Natl Acad. Sci. 114. Voigt, C. A., Kauffman, S. & Wang, Z. G. Rational biological and psychological perspectives on cognitive USA 102, 95419546 (2005). evolutionary design: the theory of in vitro protein coordination in schizophrenia. Behav. Brain Sci. 26, 98. Rescorla, R. A. & Wagner, A. R. in Classical evolution. Adv. Protein Chem. 55, 79160 (2000). 6582 (2003). Conditioning II: Current Research and Theory (eds 115. Freeman, W. J. Characterization of state transitions in 130. Friston, K. & Kiebel, S. Cortical circuits for perceptual Black, A. H. & Prokasy, W. F.) 6499 (Appleton spatially distributed, chaotic, nonlinear, dynamical inference. Neural Netw. 22, 10931104 (2009). Century Crofts, New York, 1972). systems in cerebral cortex. Integr. Physiol. Behav. Sci. 99. Bellman, R. On the Theory of Dynamic Programming. 29, 294306 (1994). Acknowledgments Proc. Natl Acad. Sci. USA 38, 716719 (1952). 116. Tsuda, I. Toward an interpretation of dynamic neural This work was funded by the Wellcome Trust. I would like to 100. Watkins, C. J. C. H. & Dayan, P. Q-learning. Mach. activity in terms of chaotic dynamical systems. Behav. thank my colleagues at the Wellcome Trust Centre for Learn. 8, 279292 (1992). Brain Sci. 24, 793810 (2001). Neuroimaging, the Institute of Cognitive Neuroscience and the 101. Todorov, E. in Advances in Neural Information 117. Jirsa, V. K., Friedrich, R., Haken, H. & Kelso, J. A. Gatsby Computational Neuroscience Unit for collaborations Processing Systems (eds Scholkopf, B., Platt, J. & A theoretical model of phase transitions in the human and discussions. Hofmann T.) 19, 13691376 (MIT Press, 2006). brain. Biol. Cybern. 71, 2735 (1994). 102. Camerer, C. F. Behavioural studies of strategic thinking This paper develops a theoretical model (based on Competing interests statement in games. Trends Cogn. Sci. 7, 225231 (2003). synergetics and nonlinear oscillator theory) that The author declares no competing financial interests. 103. Smith, J. M. & Price, G. R. The logic of animal conflict. reproduces observed dynamics and suggests a Nature 246, 1518 (1973). formulation of biophysical coupling among brain SUPPLEMENTARY INFORMATION 104. Nash, J. Equilibrium points in n-person games. systems. See online article: S1 (box) | S2 (box) | S3 (box) | S4 (box) |
Proc. Natl Acad. Sci. USA 36, 4849 (1950). 118. Breakspear, M. & Stam, C. J. Dynamics of a S5 (box) 105. Wolpert, D. M. & Miall, R. C. Forward models for neural system with a multiscale architecture. Philos. physiological motor control. Neural Netw. 9, Trans. R. Soc. Lond. B Biol. Sci. 360, 10511074 All liNks Are AcTive iN The oNliNe pdf 12651279 (1996). (2005). 138 | FEBRuARy 2010 | voluME 11 www.nature.com/reviews/neuro © 2010 Macmillan Publishers Limited. All rights reserved

       SUPPLEMENTARY INFORMATION                                                In format provided by Friston (FEBRUARY 2010) 
                Supplementary information S1 (box): The entropy of sensory states and their causes 
                This  box  shows  that  the  entropy  of  hidden  states  in  the  environment  is  bounded  by  the 
                entropy of sensory states. This means that if the entropy of sensory signals is minimised, so 
                is the entropy of the environmental states that caused them. For any agent or model  m  the 
                entropy of generalised sensory states  %           ′′   T  is simply their average surprise 
                                                       s(t) =[s,s ,s ,K]
                        %
                 ln p(s |m) (with a sight abuse of notion) 
                 
                                                            T
                    %             %          %      %               %                          S1.1 
                 H(s|m):=∫−p(s|m)ln p(s|m)ds = lim∫−ln p(s(t)|m)dt
                                                        T→•
                                                            0
                 
                Under ergodic assumptions, this is just the long-term time or path-integral of surprise. We will 
                assume sensory states are an analytic function of hidden environmental states plus some 
                generalised random fluctuations 
                 
                 %     %      %
                 s = g(x,θ)+ z
                 &                                                                             S1.2 
                 %      %      %
                 x = f (x,θ)+w
                 
                Here, hidden states change according to the stochastic differential equations of motion (with 
                                                  %       %
                parameters θ ) in S1.2. Because  x and  z  are statistically independent, we have (see Eq. 
                6.4.6 in Jones 1979, p149) 
                 
                   % %       %          %            %              %
                 I(s,z) = H(s |m)H(x|m) p(x|m)ln|∂%g|dx                                      S1.3 
                                                ∫              x
                 
                         % %
                Here,  I(s,z) ≥ 0 is the mutual information between the sensory states and noise. By Gibbs 
                inequality  this  cross-entropy  or  Kullback-Leibler  divergence is non-negative (Theorem 6.5; 
                Jones 1979, p151). This means the entropy of the sensory states is greater than the entropy 
                of the sensory mapping. Here. ∂%g  is the sensitivity or gradient of the sensory mapping with 
                                                x
                respect to the hidden states. The integral in S1.3 reflects the fact that entropy is not invariant 
       NATURE REVIEWS | NEUROSCIENCE                                                          www.nature.com/reviews/neuro 
                                              © 2010 Macmillan Publishers Limited.  All rights reserved. 
     
   SUPPLEMENTARY INFORMATION                       In format provided by Friston (FEBRUARY 2010) 
                                                      %  %
          to a change of variables and assumes that the sensory mapping  g : x → s   is diffeomorphic 
          (i.e., bijective and smooth). This requires the hidden and sensory state-spaces to have the 
          same dimension, which can be assured by truncating generalised states at an appropriately 
          high order. For example, if we had  n hidden states in  m  generalised coordinates of motion, 
          we  would  consider  m  sensory  states  in  n  generalised  coordinates;  so  that 
             %      %
          dim(x)=dim(s)=n×m.  Finally, rearranging S1.3 gives 
           
            %       %       %         %
          H(x|m)≤H(s|m) p(x|m)ln|∂%g|dx                      S1.4 
                         ∫         x
           
          In conclusion, the entropy of hidden states is upper-bounded by the entropy of sensations, 
          assuming their sensitivity to hidden states is constant, over the range of states encountered. 
           
          Clearly,  the  ergodic  assumption  in  S1.1  only  holds  over  certain  temporal  scales  for  real 
          organisms that are on a trajectory from birth to death. This scale can be somatic (e.g., over 
          days  or  months,  where  development  is  locally  stationary)  or  evolutionary  (e.g.,  over 
          generations, where evolution is locally stationary). 
           
           
          Reference 
          Jones, DS. (1979). Elementary information theory. Publisher: Oxford: Clarendon Press; New 
          York: Oxford University Press 
           
   NATURE REVIEWS | NEUROSCIENCE                             www.nature.com/reviews/neuro 
                             © 2010 Macmillan Publishers Limited.  All rights reserved. 
        
      SUPPLEMENTARY INFORMATION                                           In format provided by Friston (FEBRUARY 2010) 
               Supplementary information S2 (box): Variational free energy 
               Here, we derive the free-energy and show how its various formulations relate to each other. 
               We start with the quantity we want to bound; namely, the surprise or log-evidence associated 
                                  %                                                           %
               with sensory states  s(t)  that have been caused by some unknown quantities ϑ …{x,θ} , 
               which include the hidden states and parameters in box (S1)  
                
                      %              %
               ln p(s(t)) = ln∫ p(s(t),ϑ)dϑ                                            S2.1 
                
                                                            %
               To create a free-energy bound on surprise  F (s(t),q(ϑ)), we simply add a non-negative 
               cross-entropy  between  an  arbitrary  (recognition)  density  on  the  causes  q(ϑ)   and  their 
                                     %
               posterior density  p(ϑ | s)  (dropping the dependency on m  for clarity). 
                
                              q(ϑ)             %
               F =∫q(ϑ)ln            dϑ−ln p(s)
                                  %
                             p(ϑ|s)
                                                                                         S2.2 
                                   %        %
                  =D(q(ϑ)|| p(ϑ|s))ln p(s)
                
               The cross-entropy term is non-negative by Gibbs inequality. In short, free-energy is cross-
               entropy plus surprise. Because surprise depends only on sensory states, we can bring it 
                                              %         %   %
               inside the integral and use  p(ϑ,s) = p(ϑ | s)p(s) to show free-energy is expected energy 
               minus entropy 
                
               F =∫q(ϑ)ln       q(ϑ)     dϑ
                                  %    %
                             p(ϑ|s)p(s)
                                                    %
                  =∫q(ϑ)lnq(ϑ)dϑ−∫q(ϑ)ln p(ϑ,s)dϑ                                        S2.3 
                              %
                  = ln p(ϑ,s)     lnq(ϑ)
                                 q            q
                
      NATURE REVIEWS | NEUROSCIENCE                                                     www.nature.com/reviews/neuro 
                                           © 2010 Macmillan Publishers Limited.  All rights reserved. 
        
      SUPPLEMENTARY INFORMATION                                      In format provided by Friston (FEBRUARY 2010) 
                            %                                               %      %
              where  ln p(ϑ,s) is Gibbs energy. A final rearrangement, using  p(ϑ,s) = p(s |ϑ)p(ϑ), 
              shows free-energy is also complexity minus accuracy, where complexity is the cross-entropy 
              between the recognition q(ϑ) and prior density  p(ϑ) 
               
              F =∫q(ϑ)ln      q(ϑ)    dϑ
                             %
                           p(s |ϑ)p(ϑ)
                          q(ϑ)                %
                =∫q(ϑ)ln p(ϑ)dϑ−∫q(ϑ)ln p(s|ϑ)dϑ                                  S2.4 
                                       %
                =D(q(ϑ)|| p(ϑ)) ln p(s|ϑ)
                                             q
                 
       
      NATURE REVIEWS | NEUROSCIENCE                                              www.nature.com/reviews/neuro 
                                       © 2010 Macmillan Publishers Limited.  All rights reserved. 
       
     SUPPLEMENTARY INFORMATION                                In format provided by Friston (FEBRUARY 2010) 
            Supplementary information S3 (box): The free-energy principle and infomax 
            Here, we show that the free-energy principle is a probabilistic generalisation of the infomax 
                                                                 %
            principle.  The  infomax principle requires the mutual information  I(s,µ) between sensory 
            data and their conditional representation  µ(t) to be maximal, under prior constraints on the 
            representations; e.g.,  p(µ) = N (0,I). This can be stated as an optimisation of an infomax 
            criterion 
             
            µ∗ =argmaxG
                   µ
                                                                          S3.1 
                  %
             G=I(s,µ)H(µ)
                   %     %
               =H(s)H(s|µ)H(µ)
             
            Because the representations do not change sensory data, they are only required to minimise 
            the average surprise about them, given the representations; and the average surprise about 
            the representations, given their prior constraints. These are the last two terms in (S3.1). If the 
            recognition density is a point  mass at  µ(t); i.e.,  q(ϑ) =δ(ϑ −µ), the  free-energy from 
            (S2.4) reduces to 
             
                      %
            F =ln p(s|µ)ln p(µ)                                         S3.2 
             
            From (S1.1), the path-integral of free-energy (also known as free-action) becomes 
             
                      %           %
            AF=∫dt (s(t),µ(t))µ H(s|µ)+H(µ)                               S3.3 
             
            This means optimising the conditional expectations with respect to free-energy and (by the 
            fundamental  lemma  of  variational  calculus)  free-action,  is  exactly  the  same  as  same  as 
            optimising the infomax criterion 
             
     NATURE REVIEWS | NEUROSCIENCE                                       www.nature.com/reviews/neuro 
                                   © 2010 Macmillan Publishers Limited.  All rights reserved. 
      
    SUPPLEMENTARY INFORMATION                             In format provided by Friston (FEBRUARY 2010) 
           µ∗ =argminFA=argmin   =argmaxG                            S3.4 
                  µ        µ         µ
            
           In short, the infomax principle is a special case of the free-energy principle that obtains when 
           we discount uncertainty and represent sensory data with point estimates of their causes. 
           Alternatively,  the  free-energy  is  a  generalisation  of  the  infomax  principle  that  covers 
           probability densities on the unknown causes of data. In this context, high mutual information 
           is assured by maximising accuracy (e.g., minimising prediction error) and the prior constraints 
           are enforced by minimising complexity (see S2.4) 
    NATURE REVIEWS | NEUROSCIENCE                                    www.nature.com/reviews/neuro 
                                 © 2010 Macmillan Publishers Limited.  All rights reserved. 
       
     SUPPLEMENTARY INFORMATION                               In format provided by Friston (FEBRUARY 2010) 
            Supplementary information S4 (box): Value and surprise 
            Here, we compare and contrast optimal control and free-energy formulations of dynamics on 
            hidden or sensory states. To keep things simple, we will assume the hidden states are known 
                                                                        %
            (as is usually assumed in control theory) and ignore random fluctuations; i.e.,  w(t) = 0 (see 
            box S1). In optimum control, one starts with a loss or cost-function (negative reward or utility), 
              %
            c(x) and optimises the motion of states to maximise value or expected reward over time 
             
                            %        %
                a =argmax f(x,a)⋅∇V(x)
                       a
                    •                                                    S4.1 
              %          %       & %      %
            V(x(0)) = ∫−c(x(t))dt ⇒V(x(t)) = c(x)
                     0
             
            The first equality says that motion ascends the gradients of the value-function and the second 
            just  defines value as reward that will be accumulated in the future. Note the equations of 
                  &
            motion  %  %    now include action. The value-function is the solution to the celebrated 
                  x = f (x,a)
            Hamilton-Jacobi-Bellman equation 
             
                 & %      %
            max V(x(t))c(x) =0⇒
                {           }
              a                                                          S4.2 
                   %       %     %
            max f(x,a)⋅∇V(x)c(x) =0
                {                 }
              a
             
            This solution ensures that the rate of change of value is cost, as required by the definition of 
            value. In summary, (S4.1) says that action maximises value and (S4.2) means that value is 
            the reward expected under this policy. This ensures low-cost regions attract all trajectories 
            through state-space. 
             
            We now revisit value from the perspective of surprise and free-energy. If we put the random 
            fluctuations  back  and  assume  a  general  form  (the  Helmholtz  decomposition)  for  motion: 
            f =∇V +∇×W , it is fairly  easy  to  relate  value  and  surprise  (using  the  Fokker-Planck 
            equation, subject to ∇V ⋅(∇×W) = 0) 
     NATURE REVIEWS | NEUROSCIENCE                                      www.nature.com/reviews/neuro 
                                   © 2010 Macmillan Publishers Limited.  All rights reserved. 
           
        SUPPLEMENTARY INFORMATION                                                           In format provided by Friston (FEBRUARY 2010) 
                    
                    
                       %            %
                   V(x)=γ ln p(x|m)
                                          2                                                                  S4.3 
                       %
                    c(x) = f ⋅∇V +γ∇ V
                    
                   Here,  γ > 0 encodes the amplitude of the random fluctuations (and is known as an inverse 
                   sensitivity  or  temperature  parameter).  The  first  equality  shows  that  value  is  inversely 
                   proportional to surprise, where free-energy is surprise because we know the true states. This 
                   means the value of a state is proportional to the log-probability of finding an agent  m  in that 
                   state. This is also the log-sojourn time or the proportion of time the state is occupied by that 
                   agent.  
                    
                   In  the  limit  of  small  fluctuations  γ → 0,  the  ensemble  density        %               1   %    
                                                                                               p(x|m)=exp(γ V(x))
                   becomes a point mass at the minimum of the cost-function. This somewhat trivial case serves 
                   to connect optimal control theory to the equilibrium treatment that underpins the free-energy 
                   scheme. In this limit, cost is just the rate of change of value:          %                & %      , as 
                                                                                          c(x) = f ⋅∇V =V(x(t))
                   mandated  by  the  definition  of  value  in  Equation  S4.1,  which  is  the  solution  to  the 
                   (deterministic) Hamilton-Jacobi-Bellman equation (S4.2).  
                    
                   Crucially, Equation S4.3 also shows that peaks of the equilibrium density can only exist where 
                   cost is zero or less 
                    
                    ∇V(x)=0
                                 ⇒c(x)≤0                                                                    S4.4 
                      2
                   ∇V(x)≤0
                    
                   with c(x) = 0  in the limit γ → 0.  
                    
                   In summary, optimal control theory starts with a cost-function and solves for a value-function 
                   that  guides  the  flow  or  policy  to  minimise  expected  cost.  Conversely,  the  equilibrium 
        NATURE REVIEWS | NEUROSCIENCE                                                                       www.nature.com/reviews/neuro 
                                                     © 2010 Macmillan Publishers Limited.  All rights reserved. 
     
   SUPPLEMENTARY INFORMATION                       In format provided by Friston (FEBRUARY 2010) 
          perspective starts with flow and derives the implicit value and cost-functions, where value is 
          inversely proportional to surprise. In the last supplementary information box (S5), we show 
          how cost  can  define  policies,  without  solving  the  (generally  intractable)  Hamilton-Jacobi-
          Bellman equation. 
   NATURE REVIEWS | NEUROSCIENCE                             www.nature.com/reviews/neuro 
                             © 2010 Macmillan Publishers Limited.  All rights reserved. 
         
       SUPPLEMENTARY INFORMATION                                                In format provided by Friston (FEBRUARY 2010) 
                Supplementary information S5 (box): Policies and cost 
                This box describes a scheme that ensures agents are attracted to locations in state-space, 
                using prior expectations about the motion of hidden states;  %           T      comprising 
                                                                             x(t) =[x,x ] ∈X
                position and velocity. This formulation of how an ensemble density can be restricted to an 
                attractive subset of state-space  A⊂ X  rests on the Fokker-Planck description (see Frank 
                2004) of how the density changes with time 
                 
                 & %           2                    
                 p(x|m)=γ∇ p p∇⋅ f  f ⋅∇p
                 
                                & %
                At equilibrium,  p(x | m) = 0 and  
                 
                            γ∇2p− f ⋅∇p
                    %
                 p(x|m)=        ∇⋅ f                                                            S5.1 
                 
                Notice that as the divergence ∇⋅ f   increases, the sojourn time (i.e., the proportion of time a 
                state is occupied) falls. Crucially, at the peaks of the ensemble density, the gradient is zero 
                and its curvature is negative, which means the divergence must be negative (from Equation 
                S5.1)  
                 
                    p>0
                  ∇p=0⇒∇⋅f <0
                                                                                               S5.2 
                 ∇2p<0
                         
                 
                This provides a simple and general mechanism to ensure peaks of the ensemble density lie 
                                 A⊂X                                %              %                 %
                in,  and  only  in       .  This  is  assured  if  ∇⋅ f (x) < 0   when  x∈ A  and  ∇⋅ f (x) ≥ 0  
                otherwise. We can exploit this using the generic equations of motion 
                 
                         x′    
                 f = cx′−∂ ϕ     ⇒ ∇⋅f =c                                                     S5.3 
                            x  
       NATURE REVIEWS | NEUROSCIENCE                                                           www.nature.com/reviews/neuro 
                                              © 2010 Macmillan Publishers Limited.  All rights reserved. 
       
     SUPPLEMENTARY INFORMATION                                     In format provided by Friston (FEBRUARY 2010) 
              
             This flow describes the Newtonian motion of a unit mass in a potential energy well ϕ(x,θ) , 
             where cost plays the role of negative dissipation or friction. Crucially, under this policy or flow, 
             divergence is simply cost; meaning the associated ensemble density can only have maxima 
             in regions of negative cost. This provides a means to specify attractive regions  A⊂ X  by 
             assigning them negative cost  
              
             c(x) ≤ 0: x∈A
             c(x) > 0: x∉A                                                      S5.4 
              
             Put simply, this scheme ensures that agents are expelled from high-cost regions of state-
             space and get stuck in attractive regions. 
              
             In summary, the previous supplementary information box (S4) showed that any flow can be 
             described in terms of a scalar value-function (and vector potential W ), from which an implicit 
             cost-function can be derived. In this box (S5), we have addressed the inverse problem of how 
             cost can be used to constrain flow, ensuring that it leads to attractive, low-cost states. The 
             ensuing policy or flow can be used in a generative model of flow or state-transitions to provide 
             predictions that action fulfils, under the free-energy principle. A full discussion of these and 
             related ideas will be presented in Friston et al (in preparation). 
              
              
             Reference 
             Frank  TD  (2004).  Nonlinear  Fokker-Planck  Equations:  Fundamentals  and  Applications. 
             Springer Series in Synergetics (Berlin: Springer) 
              
     NATURE REVIEWS | NEUROSCIENCE                                             www.nature.com/reviews/neuro 
                                      © 2010 Macmillan Publishers Limited.  All rights reserved.